How can I efficiently handle multiple Perl search/replace operations on the same string?
Solution 1
Problem #1
As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text
into $_
, so that instead of having to say:
$text =~ s/foo/bar/g;
You can just say:
s/foo/bar/g;
A common idiom for doing this is to use a degenerate for()
loop as a topicalizer:
for($text)
{
s/foo/bar/g;
s/qux/meh/g;
...
}
The scope of this block will preserve any preexisting value of $_
, so there's no need to explicitly local
ize $_
.
At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?
Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.
Problem #2
You can use the qr//
syntax to quote the "search" part of the substitution:
my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;
However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr//
would work for this too, but it doesn't. There are two alternatives worth considering:
1. Use eval()
in your foreach
loop. This would enable you to keep your current %rxcheck2
hash. Downside: you should always be concerned about safety with string eval()
s.
2. Use an array of anonymous subroutines:
my @replacements = (
sub { $_[0] =~ s/<[^>]+>/ /g; },
sub { $_[0] =~ s/\s+/ /g; },
sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
sub { $_[0] =~ s/\s+/ /g; },
sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);
# Assume your data is in $_
foreach my $repl (@replacements) {
&{$repl}($_);
}
You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.
Solution 2
You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.
A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.
Solution 3
Hashes are not good because they are unordered. I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best:
#!/usr/bin/perl
use strict;
use warnings;
my @replace = (
[ qr/(bar)/ => '"<$1>"' ],
[ qr/foo/ => '"bar"' ],
);
my $s = "foo bar baz foo bar baz";
for my $replace (@replace) {
$s =~ s/$replace->[0]/$replace->[1]/gee;
}
print "$s\n";
I think j_random_hacker's second solution is vastly superior to mine. Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee
solution:
bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
Rate refs subs
refs 10288/s -- -91%
subs 111348/s 982% --
Here is the code that produces those numbers:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my @subs = (
sub { $_[0] =~ s/(bar)/<$1>/g },
sub { $_[0] =~ s/foo/bar/g },
);
my @refs = (
[ qr/(bar)/ => '"<$1>"' ],
[ qr/foo/ => '"bar"' ],
);
my %subs = (
subs => sub {
my $s = "foo bar baz foo bar baz";
for my $sub (@subs) {
$sub->($s);
}
return $s;
},
refs => sub {
my $s = "foo bar baz foo bar baz";
for my $ref (@refs) {
$s =~ s/$ref->[0]/$ref->[1]/gee;
}
return $s;
}
);
for my $sub (keys %subs) {
print $subs{$sub}(), "\n";
}
Benchmark::cmpthese -1, \%subs;
Jeff
Updated on June 07, 2022Comments
-
Jeff about 2 years
So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:
$text =~ s/<[^>]+>/ /g; $text =~ s/\s+/ /g; $text =~ s/[\(\{\[]\d+[\(\{\[]/ /g; $text =~ s/\s+[<>]+\s+/\. /g; $text =~ s/\s+/ /g; $text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The $text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...
As you can see, I'm dealing with nasty html and have to beat it into submission.
I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.
I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:
%rxcheck = ( 'time of day'=>'\d+:\d+', 'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]', 'ends with a single capital letter'=>'\b[A-Z]\.' }
And this is how I use it:
foreach my $key (keys %rxcheck) { if($snippet =~ /$rxcheck{ $key }/g){ blah blah } }
The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.
%rxcheck2 = ( '(\w) \"'=>'$1\"' }
The above is to do this:
$snippet =~ s/(\w) \"/$1\"/g;
But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:
if($snippet =~ /$key/$rxcheck2{ $key }/g){ }
And that doesn't work.
So 2 questions:
Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?
Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?
Thanks for your help -
-
Ankit Roy about 15 years+1. Good point about hashes being unordered -- the order of applying search & replace operations can make a big difference. I'm confused why 2 "e" flags are needed -- wouldn't one be enough? Could you step me through it?
-
Chas. Owens about 15 years/e and /ee are safer than string eval
-
Ankit Roy about 15 years@Chas: Definitely prettier in this case, but how are they safer?
-
Chas. Owens about 15 yearsDue to a bug in the flag evaluation portion of regexes people found that each extra e added another level of eval. This was found to be handy, so it got promoted to a feature. With /e the first replace becomes '<$1>', that is you see '<$1>' in $s. The second e then evals '<$1>' producing the desired '<bar>' replacement.
-
Chas. Owens about 15 yearsBut I like the subroutine version.
-
Chas. Owens about 15 yearsHmm, I know /e is safer because it is more like eval {} than eval "", but /ee may not be safer, but I can't remember why.
-
Drew Stephens about 15 yearsYou can use Tie::DxHash to maintain insertion order order: search.cpan.org/~kruscoe/Tie-DxHash-1.05/lib/Tie/DxHash.pm
-
Chas. Owens about 15 years@dinomite Yes, but at the loss of performance with no real gain in readability. This isn't really a job for a hash (keys are not randomly accessed, there is no need for unique keys, the data is not unordered, etc). An array of coderefs seems to be the best solution.
-
Chas. Owens about 15 yearsGood point, I was answering the general question of how do run multiple regexes against a string in a maintainable way, but the specific question is about running a regex on HTML, which is a no-no. See stackoverflow.com/questions/701166http://stackoverflow.com/… for why and stackoverflow.com/questions/773340http://stackoverflow.com/… for examples on how to use HTML parsers.
-
Ankit Roy about 15 years@Chas: Thanks, but I'm wondering why you could/would not just say qr/(bar)/ => '<$1>' and then use a single /e. (I'm aware of /ee, /eee etc... so far I haven't found cause to use them but I'm on the lookout :))
-
Chas. Owens about 15 years@j_random_hacker because /e is evaluating $ref->[1] not the contents of $ref->[1]. The double quoted string nature of the replace is removed when you say /e.
-
brian d foy about 15 years/e is just a string eval. /ee is the same thing, but you take the result of the first /e and do it again. There isn't a safety feature by adding or subtracting an /e.
-
brian d foy about 15 yearsHTML::Parser is often too much work for the nastiness of some data sources. If you can do a bunch of quick substitutions to regularize the input, you can make things easier down the road. This isn't a question about parsing HTML, but cleaning up dirty data.
-
Chas. Owens about 15 years@j_random_hacker $ref->[1] is interpolated when there is no /e, but when /e is in effect there is no interpolation step.
-
Ankit Roy about 15 years@Chas: I think I've finally got it -- /e implies no interpolation (like single quotes). Thanks for your patience :)
-
Ankit Roy about 15 yearsI really like John Siracusa's edit, suggesting using "for ($mystr) { ... }" as a way to "topicalise" -- neat!
-
Olivier Dulac about 11 years@Chas.Owens: +1 for the very interresting (and quite generic) way to time and try different ways. But in general, what is, for you, the most efficient way (and I mean, maybe not any of those 2, as those need to call subs, which I believe adds overhead?) to do many search/replace in Perl? I'm writing a "colorizer" which looks for various simple-to-complex strings and adds Ansi color codes before and after each (or sometimes portions of them)... And it's sloooow when there are many search/replace or when the files to colorize gets close to several megabytes...