How can I efficiently handle multiple Perl search/replace operations on the same string?

regex perl hash substitution

13,851

Solution 1

Problem #1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text into $_, so that instead of having to say:

$text =~ s/foo/bar/g;

You can just say:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer:

for($text)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_, so there's no need to explicitly localize $_.

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?

Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.

Problem #2

You can use the qr// syntax to quote the "search" part of the substitution:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr// would work for this too, but it doesn't. There are two alternatives worth considering:

1. Use eval() in your foreach loop. This would enable you to keep your current %rxcheck2 hash. Downside: you should always be concerned about safety with string eval()s.

2. Use an array of anonymous subroutines:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

Solution 2

You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.

A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.

Solution 3

Hashes are not good because they are unordered. I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best:

#!/usr/bin/perl

use strict;
use warnings;

my @replace = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my $s = "foo bar baz foo bar baz";

for my $replace (@replace) {
    $s =~ s/$replace->[0]/$replace->[1]/gee;
}

print "$s\n";

I think j_random_hacker's second solution is vastly superior to mine. Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee solution:

bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
         Rate refs subs
refs  10288/s   -- -91%
subs 111348/s 982%   --

Here is the code that produces those numbers:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my @subs = (
    sub { $_[0] =~ s/(bar)/<$1>/g },
    sub { $_[0] =~ s/foo/bar/g },
);

my @refs = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my %subs = (
    subs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $sub (@subs) {
            $sub->($s);
        }
        return $s;
    },
    refs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $ref (@refs) {
            $s =~ s/$ref->[0]/$ref->[1]/gee;
        }
        return $s;
    }
);

for my $sub (keys %subs) {
    print $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;

13,851

Author by

Jeff

Updated on June 07, 2022

Comments

Jeff about 2 years
So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:
```
$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[$\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The 
$text =~ s/\.\s*\([^$]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...
```
As you can see, I'm dealing with nasty html and have to beat it into submission.

I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.

I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:
```
%rxcheck = (
        'time of day'=>'\d+:\d+', 
    'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
    'ends with a single capital letter'=>'\b[A-Z]\.'
}
```
And this is how I use it:
```
 foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah  }
 }
```
The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.
```
%rxcheck2 = (
        '(\w) \"'=>'$1\"'
}
```
The above is to do this:
```
$snippet =~ s/(\w) \"/$1\"/g;
```
But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:
```
if($snippet =~ /$key/$rxcheck2{ $key }/g){  }
```
And that doesn't work.

So 2 questions:

Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?

Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?

Thanks for your help -
Ankit Roy about 15 years

+1. Good point about hashes being unordered -- the order of applying search & replace operations can make a big difference. I'm confused why 2 "e" flags are needed -- wouldn't one be enough? Could you step me through it?
Chas. Owens about 15 years

/e and /ee are safer than string eval
Ankit Roy about 15 years

@Chas: Definitely prettier in this case, but how are they safer?
Chas. Owens about 15 years

Due to a bug in the flag evaluation portion of regexes people found that each extra e added another level of eval. This was found to be handy, so it got promoted to a feature. With /e the first replace becomes '<$1>', that is you see '<$1>' in $s. The second e then evals '<$1>' producing the desired '<bar>' replacement.
Chas. Owens about 15 years

But I like the subroutine version.
Chas. Owens about 15 years

Hmm, I know /e is safer because it is more like eval {} than eval "", but /ee may not be safer, but I can't remember why.
Drew Stephens about 15 years

You can use Tie::DxHash to maintain insertion order order: search.cpan.org/~kruscoe/Tie-DxHash-1.05/lib/Tie/DxHash.pm
Chas. Owens about 15 years

@dinomite Yes, but at the loss of performance with no real gain in readability. This isn't really a job for a hash (keys are not randomly accessed, there is no need for unique keys, the data is not unordered, etc). An array of coderefs seems to be the best solution.
Chas. Owens about 15 years

Good point, I was answering the general question of how do run multiple regexes against a string in a maintainable way, but the specific question is about running a regex on HTML, which is a no-no. See stackoverflow.com/questions/701166http://stackoverflow.com/… for why and stackoverflow.com/questions/773340http://stackoverflow.com/… for examples on how to use HTML parsers.
Ankit Roy about 15 years

@Chas: Thanks, but I'm wondering why you could/would not just say qr/(bar)/ => '<$1>' and then use a single /e. (I'm aware of /ee, /eee etc... so far I haven't found cause to use them but I'm on the lookout :))
Chas. Owens about 15 years

@j_random_hacker because /e is evaluating $ref->[1] not the contents of $ref->[1]. The double quoted string nature of the replace is removed when you say /e.
brian d foy about 15 years

/e is just a string eval. /ee is the same thing, but you take the result of the first /e and do it again. There isn't a safety feature by adding or subtracting an /e.
brian d foy about 15 years

HTML::Parser is often too much work for the nastiness of some data sources. If you can do a bunch of quick substitutions to regularize the input, you can make things easier down the road. This isn't a question about parsing HTML, but cleaning up dirty data.
Chas. Owens about 15 years

@j_random_hacker $ref->[1] is interpolated when there is no /e, but when /e is in effect there is no interpolation step.
Ankit Roy about 15 years

@Chas: I think I've finally got it -- /e implies no interpolation (like single quotes). Thanks for your patience :)
Ankit Roy about 15 years

I really like John Siracusa's edit, suggesting using "for ($mystr) { ... }" as a way to "topicalise" -- neat!
Olivier Dulac about 11 years

@Chas.Owens: +1 for the very interresting (and quite generic) way to time and try different ways. But in general, what is, for you, the most efficient way (and I mean, maybe not any of those 2, as those need to call subs, which I believe adds overhead?) to do many search/replace in Perl? I'm writing a "colorizer" which looks for various simple-to-complex strings and adds Ansi color codes before and after each (or sometimes portions of them)... And it's sloooow when there are many search/replace or when the files to colorize gets close to several megabytes...