Perl multiline regex

regex perl multiline regex-group

10,047

Solution 1

/m affects what ^ and $ match. You use neither, so /m has no effect.

You only read a single line at a time, so you only match against a single line at a time. /m cannot possibly cause the regex to match against data that is awaiting to be read from some file handle it doesn't know anything about.

You could load the entire file into memory by using -0777 and loop over all matches instead of just grabbing the first.

Solution 2

This is pretty straightforward with just grep and sed:

grep adGroupId listado.txt | sed -E  "s/[^0-9]+//g"

Match lines with adGroupId in them
Remove everything that isn't a digit

Solution 3

Depending of exact structure of your data you may make use of line numbers:

while (<>) {
  if ( /NumberLong\("?(?<nr>\d+)/ ) {
    $.%2 ? print "$+{nr}-" : print "$+{nr}\n";
  }
}

Or use flags:

my $flag = 0;

while (<>) {
  if ( /NumberLong\("?(?<nr>\d+)/ ) {
    !$flag 
      ? (print "$+{nr}-" and $flag++)
      : (print "$+{nr}\n" and $flag--);
  }
}

Or with slurping:

use 5.010;
my $file;

{
  local $/;
  $file = <>;
}

while ($file =~ /adGroupId" : NumberLong\("?(?<first>\d+).+?keywordId" : NumberLong\("?(?<second>\d+)/gs ) {
  say "$+{first}-$+{second}";
}

10,047

Author by

Nicolas Rodríguez Seara

Love building solutions to everyday problems using software. Passionate, curious, entrepreneur. www.reclutapro.com

Updated on June 14, 2022

Comments

Nicolas Rodríguez Seara almost 2 years

I have a file full of json objects to parse, similar to this one:

{
"_id" : ObjectId("523a58c1e4b09611f4c58a66"),
"_items" : [
    {
        "adGroupId" : NumberLong(1230610621),
        "keywordId" : NumberLong("5458816773")
    },
    {
        "adGroupId" : NumberLong(1230613681),
        "keywordId" : NumberLong("3204196588")
    },
    {
        "adGroupId" : NumberLong(1230613681),
        "keywordId" : NumberLong("4340421772")
    },
    {
        "adGroupId" : NumberLong(1230615571),
        "keywordId" : NumberLong("10525630645")
    },
    {
        "adGroupId" : NumberLong(1230617641),
        "keywordId" : NumberLong("4178290208")
    }
]}

I want to take the numbers from inside de NumberLong(). At first I needed just the keywordId, and managed to accomplish it with:

cat listado.txt |& perl -ne 'print "$1," if /\"keywordId\" : NumberLong\(\"?(\d*)\"?\)/' keywordIds.txt

This generated a comma separated file with the numbers. I now need also de adGroupIds, so I'm trying the following matching regex with no luck:

cat ./work/listado.txt |& perl -ne 'print "$1-$2," if /\"adGroupId\" : NumberLong\(\"?(\d*)\"?\),\s*\"keywordId\" : NumberLong\(\"?(\d*)\"?\)/m'

The regex matches, but I believe perl is not doing multiline, even though I'm using /m.

Any ideas?

Hunter McMillen over 10 years

He claims to want the numbers. How is this any different? (Other than the lack of commas)
Nicolas Rodríguez Seara over 10 years

You are only capturing the adgroupid numbers, I need both, adgroupid and keywordid, in a file like this: group1-keyword1, group2-keywd2, ...
ikegami over 10 years

There's a big difference between 1-2,3-4,5-6 and 1\n3\n5
Nicolas Rodríguez Seara over 10 years

That returns ok the first group, output: "1230610621-5458816773,". How do I make it keep going?. Oh, and the file is 100MB, if I can avoid uploading it all to mem, better
ikegami over 10 years

print "$1-$2," while /.../g;. Or without the extra comma, push @matches, "$1-$2" while /.../g; END { print join ',' @matches }
tijagi over 9 years

@Nicolas It surprises me that nobody posted a variant in sed yet. sed -nr 's/.*adGroupId.*$([0-9]+)$.*/\1/; Te; N; s/\n.*keywordId.*$"([0-9]+)"$.*$/-\1/; H; :e ${g;s/^\n//;s/\n/,/g;p};' <file