How can I efficiently handle multiple Perl search/replace operations on the same string?

13,851

Solution 1

Problem #1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text into $_, so that instead of having to say:

$text =~ s/foo/bar/g;

You can just say:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer:

for($text)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_, so there's no need to explicitly localize $_.

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?

Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.

Problem #2

You can use the qr// syntax to quote the "search" part of the substitution:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr// would work for this too, but it doesn't. There are two alternatives worth considering:

1. Use eval() in your foreach loop. This would enable you to keep your current %rxcheck2 hash. Downside: you should always be concerned about safety with string eval()s.

2. Use an array of anonymous subroutines:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

Solution 2

You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.

A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.

Solution 3

Hashes are not good because they are unordered. I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best:

#!/usr/bin/perl

use strict;
use warnings;

my @replace = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my $s = "foo bar baz foo bar baz";

for my $replace (@replace) {
    $s =~ s/$replace->[0]/$replace->[1]/gee;
}

print "$s\n";

I think j_random_hacker's second solution is vastly superior to mine. Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee solution:

bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
         Rate refs subs
refs  10288/s   -- -91%
subs 111348/s 982%   --

Here is the code that produces those numbers:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my @subs = (
    sub { $_[0] =~ s/(bar)/<$1>/g },
    sub { $_[0] =~ s/foo/bar/g },
);

my @refs = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my %subs = (
    subs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $sub (@subs) {
            $sub->($s);
        }
        return $s;
    },
    refs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $ref (@refs) {
            $s =~ s/$ref->[0]/$ref->[1]/gee;
        }
        return $s;
    }
);

for my $sub (keys %subs) {
    print $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;
Share:
13,851
Jeff
Author by

Jeff

Updated on June 07, 2022

Comments

  • Jeff
    Jeff about 2 years

    So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:

    $text =~ s/<[^>]+>/ /g;
    $text =~ s/\s+/ /g;
    $text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
    $text =~ s/\s+[<>]+\s+/\. /g;
    $text =~ s/\s+/ /g;
    $text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The 
    $text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...
    

    As you can see, I'm dealing with nasty html and have to beat it into submission.

    I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.

    I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:

    %rxcheck = (
            'time of day'=>'\d+:\d+', 
        'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
        'ends with a single capital letter'=>'\b[A-Z]\.'
    }
    

    And this is how I use it:

     foreach my $key (keys %rxcheck) {
    if($snippet =~ /$rxcheck{ $key }/g){ blah blah  }
     }
    

    The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.

    %rxcheck2 = (
            '(\w) \"'=>'$1\"'
    }
    

    The above is to do this:

    $snippet =~ s/(\w) \"/$1\"/g;
    

    But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:

    if($snippet =~ /$key/$rxcheck2{ $key }/g){  }
    

    And that doesn't work.

    So 2 questions:

    Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?

    Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?

    Thanks for your help -

  • Ankit Roy
    Ankit Roy about 15 years
    +1. Good point about hashes being unordered -- the order of applying search & replace operations can make a big difference. I'm confused why 2 "e" flags are needed -- wouldn't one be enough? Could you step me through it?
  • Chas. Owens
    Chas. Owens about 15 years
    /e and /ee are safer than string eval
  • Ankit Roy
    Ankit Roy about 15 years
    @Chas: Definitely prettier in this case, but how are they safer?
  • Chas. Owens
    Chas. Owens about 15 years
    Due to a bug in the flag evaluation portion of regexes people found that each extra e added another level of eval. This was found to be handy, so it got promoted to a feature. With /e the first replace becomes '<$1>', that is you see '<$1>' in $s. The second e then evals '<$1>' producing the desired '<bar>' replacement.
  • Chas. Owens
    Chas. Owens about 15 years
    But I like the subroutine version.
  • Chas. Owens
    Chas. Owens about 15 years
    Hmm, I know /e is safer because it is more like eval {} than eval "", but /ee may not be safer, but I can't remember why.
  • Drew Stephens
    Drew Stephens about 15 years
    You can use Tie::DxHash to maintain insertion order order: search.cpan.org/~kruscoe/Tie-DxHash-1.05/lib/Tie/DxHash.pm
  • Chas. Owens
    Chas. Owens about 15 years
    @dinomite Yes, but at the loss of performance with no real gain in readability. This isn't really a job for a hash (keys are not randomly accessed, there is no need for unique keys, the data is not unordered, etc). An array of coderefs seems to be the best solution.
  • Chas. Owens
    Chas. Owens about 15 years
    Good point, I was answering the general question of how do run multiple regexes against a string in a maintainable way, but the specific question is about running a regex on HTML, which is a no-no. See stackoverflow.com/questions/701166http://stackoverflow.com/… for why and stackoverflow.com/questions/773340http://stackoverflow.com/… for examples on how to use HTML parsers.
  • Ankit Roy
    Ankit Roy about 15 years
    @Chas: Thanks, but I'm wondering why you could/would not just say qr/(bar)/ => '<$1>' and then use a single /e. (I'm aware of /ee, /eee etc... so far I haven't found cause to use them but I'm on the lookout :))
  • Chas. Owens
    Chas. Owens about 15 years
    @j_random_hacker because /e is evaluating $ref->[1] not the contents of $ref->[1]. The double quoted string nature of the replace is removed when you say /e.
  • brian d foy
    brian d foy about 15 years
    /e is just a string eval. /ee is the same thing, but you take the result of the first /e and do it again. There isn't a safety feature by adding or subtracting an /e.
  • brian d foy
    brian d foy about 15 years
    HTML::Parser is often too much work for the nastiness of some data sources. If you can do a bunch of quick substitutions to regularize the input, you can make things easier down the road. This isn't a question about parsing HTML, but cleaning up dirty data.
  • Chas. Owens
    Chas. Owens about 15 years
    @j_random_hacker $ref->[1] is interpolated when there is no /e, but when /e is in effect there is no interpolation step.
  • Ankit Roy
    Ankit Roy about 15 years
    @Chas: I think I've finally got it -- /e implies no interpolation (like single quotes). Thanks for your patience :)
  • Ankit Roy
    Ankit Roy about 15 years
    I really like John Siracusa's edit, suggesting using "for ($mystr) { ... }" as a way to "topicalise" -- neat!
  • Olivier Dulac
    Olivier Dulac about 11 years
    @Chas.Owens: +1 for the very interresting (and quite generic) way to time and try different ways. But in general, what is, for you, the most efficient way (and I mean, maybe not any of those 2, as those need to call subs, which I believe adds overhead?) to do many search/replace in Perl? I'm writing a "colorizer" which looks for various simple-to-complex strings and adds Ansi color codes before and after each (or sometimes portions of them)... And it's sloooow when there are many search/replace or when the files to colorize gets close to several megabytes...