Perl multiline regex

10,047

Solution 1

/m affects what ^ and $ match. You use neither, so /m has no effect.

You only read a single line at a time, so you only match against a single line at a time. /m cannot possibly cause the regex to match against data that is awaiting to be read from some file handle it doesn't know anything about.

You could load the entire file into memory by using -0777 and loop over all matches instead of just grabbing the first.

Solution 2

This is pretty straightforward with just grep and sed:

grep adGroupId listado.txt | sed -E  "s/[^0-9]+//g"
  1. Match lines with adGroupId in them
  2. Remove everything that isn't a digit

Solution 3

Depending of exact structure of your data you may make use of line numbers:

while (<>) {
  if ( /NumberLong\("?(?<nr>\d+)/ ) {
    $.%2 ? print "$+{nr}-" : print "$+{nr}\n";
  }
}

Or use flags:

my $flag = 0;

while (<>) {
  if ( /NumberLong\("?(?<nr>\d+)/ ) {
    !$flag 
      ? (print "$+{nr}-" and $flag++)
      : (print "$+{nr}\n" and $flag--);
  }
}

Or with slurping:

use 5.010;
my $file;

{
  local $/;
  $file = <>;
}

while ($file =~ /adGroupId" : NumberLong\("?(?<first>\d+).+?keywordId" : NumberLong\("?(?<second>\d+)/gs ) {
  say "$+{first}-$+{second}";
}
Share:
10,047
Nicolas Rodríguez Seara
Author by

Nicolas Rodríguez Seara

Love building solutions to everyday problems using software. Passionate, curious, entrepreneur. www.reclutapro.com

Updated on June 14, 2022

Comments

  • Nicolas Rodríguez Seara
    Nicolas Rodríguez Seara almost 2 years

    I have a file full of json objects to parse, similar to this one:

    {
    "_id" : ObjectId("523a58c1e4b09611f4c58a66"),
    "_items" : [
        {
            "adGroupId" : NumberLong(1230610621),
            "keywordId" : NumberLong("5458816773")
        },
        {
            "adGroupId" : NumberLong(1230613681),
            "keywordId" : NumberLong("3204196588")
        },
        {
            "adGroupId" : NumberLong(1230613681),
            "keywordId" : NumberLong("4340421772")
        },
        {
            "adGroupId" : NumberLong(1230615571),
            "keywordId" : NumberLong("10525630645")
        },
        {
            "adGroupId" : NumberLong(1230617641),
            "keywordId" : NumberLong("4178290208")
        }
    ]}
    

    I want to take the numbers from inside de NumberLong(). At first I needed just the keywordId, and managed to accomplish it with:

    cat listado.txt |& perl -ne 'print "$1," if /\"keywordId\" : NumberLong\(\"?(\d*)\"?\)/' keywordIds.txt
    

    This generated a comma separated file with the numbers. I now need also de adGroupIds, so I'm trying the following matching regex with no luck:

    cat ./work/listado.txt |& perl -ne 'print "$1-$2," if /\"adGroupId\" : NumberLong\(\"?(\d*)\"?\),\s*\"keywordId\" : NumberLong\(\"?(\d*)\"?\)/m'
    

    The regex matches, but I believe perl is not doing multiline, even though I'm using /m.

    Any ideas?

  • Hunter McMillen
    Hunter McMillen over 10 years
    He claims to want the numbers. How is this any different? (Other than the lack of commas)
  • Nicolas Rodríguez Seara
    Nicolas Rodríguez Seara over 10 years
    You are only capturing the adgroupid numbers, I need both, adgroupid and keywordid, in a file like this: group1-keyword1, group2-keywd2, ...
  • ikegami
    ikegami over 10 years
    There's a big difference between 1-2,3-4,5-6 and 1\n3\n5
  • Nicolas Rodríguez Seara
    Nicolas Rodríguez Seara over 10 years
    That returns ok the first group, output: "1230610621-5458816773,". How do I make it keep going?. Oh, and the file is 100MB, if I can avoid uploading it all to mem, better
  • ikegami
    ikegami over 10 years
    print "$1-$2," while /.../g;. Or without the extra comma, push @matches, "$1-$2" while /.../g; END { print join ',' @matches }
  • tijagi
    tijagi over 9 years
    @Nicolas It surprises me that nobody posted a variant in sed yet. sed -nr 's/.*adGroupId.*\(([0-9]+)\).*/\1/; Te; N; s/\n.*keywordId.*\("([0-9]+)"\).*$/-\1/; H; :e ${g;s/^\n//;s/\n/,/g;p};' <file