Replace multiple strings in a single pass
Solution 1
OK, a general solution. The following bash function requires 2k
arguments; each pair consists of a placeholder and a replacement. It's up to you to quote the strings appropriately to pass them into the function. If the number of arguments is odd, an implicit empty argument will be added, which will effectively delete occurrences of the last placeholder.
Neither placeholders nor replacements may contain NUL characters, but you may use standard C \
-escapes such as \0
if you need NUL
s (and consequently you are required to write \\
if you want a \
).
It requires the standard build tools which should be present on a posix-like system (lex and cc).
replaceholder() {
local dir=$(mktemp -d)
( cd "$dir"
{ printf %s\\n "%option 8bit noyywrap nounput" "%%"
printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex && cc lex.yy.c
) && "$dir"/a.out
rm -fR "$dir"
}
We assume that \
is already escaped if necessary in the arguments
but we need to escape double quotes, if present. That's what the
second argument to the second printf does. Since the lex
default action is ECHO
, we don't need to worry about it.
Example run (with timings for the skeptical; it's just a cheap-o commodity laptop):
$ time echo AB | replaceholder A B B A
BA
real 0m0.128s
user 0m0.106s
sys 0m0.042s
$ time printf %s\\n AB{0000..9999} | replaceholder A B B A > /dev/null
real 0m0.118s
user 0m0.117s
sys 0m0.043s
For larger inputs it might be useful to provide an optimization flag to cc
, and for current Posix compatibility, it would be better to use c99
. An even more ambitious implementation might try to cache the generated executables instead of generating them each time, but they're not exactly expensive to generate.
Edit
If you have tcc, you can avoid the hassle of creating a temporary directory, and enjoy the faster compile time which will help on normal sized inputs:
treplaceholder () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
$ time printf %s\\n AB{0000..9999} | treplaceholder A B B A > /dev/null
real 0m0.039s
user 0m0.041s
sys 0m0.031s
Solution 2
printf 'STRING1STRING1\n\nSTRING2STRING1\nSTRING2\n' |
od -A n -t c -v -w1 |
sed 's/ \{1,3\}//;s/\\$/&&/;H;s/.*//;x
/\nS\nT\nR\nI\nN\nG\n1/s//STRING2/
/\nS\nT\nR\nI\nN\nG\n2/s//STRING1/
/\\n/!{x;d};s/\n//g;s/./\\&/g' |
xargs printf %b
###OUTPUT###
STRING2STRING2
STRING1STRING2
STRING1
Something like this will always replace each occurrence of your target strings only once as they occur in sed
's in stream at one bite per line. This is the fastest way I can imagine you'd do it. Then again, I don't write C. But this does reliably handle null delimiters if you wish it. See this answer for how it works. This has no problems with any contained special shell characters or similar - but it is ASCII locale specific, or, in other words, od
will not output multi-byte characters on the same line and will only do one per. If this is a problem you'll want to add in iconv
.
Related videos on Youtube
Ambroz Bizjak
Updated on September 18, 2022Comments
-
Ambroz Bizjak over 1 year
I'm looking for a way to replace placeholder strings in a template file with concrete values, with common Unix tools (bash, sed, awk, maybe perl). It is important that the replacement is done in a single pass, that is, what is already scanned/replaced must not be considered for another replacement. For example, these two attempts fail:
echo "AB" | awk '{gsub("A","B");gsub("B","A");print}' >> AA echo "AB" | sed 's/A/B/g;s/B/A/g' >> AA
The correct result in this case is of course BA.
In general, the solution should be equivalent to scanning the input left-to-right for a longest match to one of the given replacement strings, and for each match, performing a replacement and continuing from that point on in the input (none of the already read input nor the replacements performed should be considered for matches). Actually, the details don't matter, just that the results of the replacement are never considered for another replacement, in whole or in part.
NOTE I am only looking for correct generic solutions. Please do not propose solutions which fail for certain inputs (input files, search and replace pairs), however unlikely they may seem.
-
Admin almost 10 yearsI assume they're longer than one character? For this you could use
tr AB BA
. -
Admin almost 10 yearsYes, they're longer than one character. Assume that both the match and replacement strings are arbitrary strings.
-
Admin almost 10 years
echo "AB"|sed 's/A/!!TMP!!/g;s/B/A/g;s/!!TMP!!/B/g'
-
Admin almost 10 years@bersch It's not correct with respect to all possible inputs.
-
Admin almost 10 yearsIn that case you need to make your question more precise to make clear how general the patterns and inputs can be.
-
Admin almost 10 yearsAnd frankly, I wouldn't be surprised if someone considered your note a bit rude.
-
Admin almost 10 years@peterph I was assuming that unless stated otherwise solutions need to be generic and correct. For some reason unknown to me people in Unix circles like half-solutions a lot.
-
Admin almost 10 yearsWell, you're not going to make it any better like this... Back to the point: specify how general outputs and inputs can be.
-
Admin almost 10 yearsI'm afraid you'll need to do it exactly as you are describing it - parse from the beginning and replace as you go - i.e. not with regular expressions.
-
Admin almost 10 yearsAny chance a mail merge could do the job?
-
Admin almost 10 years@AmbrozBizjak: there is probably no better solution with
sed
. BTW: you can parse the string with a simple for loop. Nobody answered so far, perhaps because it is really not a problem. -
Admin almost 10 years@goldilocks - there are lots of little tools that can do it combination. You probably could do it with just
sed
if you used a couple instances, thel
function, and a couple of pipes between them. But I do it below withod
andsed
- well, andxargs
andprintf
... In fairness, though, it is ASCII specific.dc
could do this as well, I think, but with the same limitation. -
Admin almost 10 yearsregarding @goldilocks' comment - obligatory reference to the canonical SO question is in place.
-
-
Ambroz Bizjak almost 10 yearsI'm not sure if this is a joke or not ;)
-
rici almost 10 years@ambrozbizjak: It works, it's quick for large inputs and acceptably fast for small inputs. It might not use the tools you were thinking of but they are standard tools. Why would it be a joke?
-
goldilocks almost 10 years+1 For not being a joke! :D
-
goldilocks almost 10 years+1 Why do you say it only replaces "the earliest occurrence of your target strings"? In the output it looks as if it replaces all of them. I'm not asking to see it, but could this be done this way without hardcoding the values?
-
mikeserv almost 10 yearsThat would be POSIX portable like
fn() { tcc ; } <<CODE\n$(gen code)\nCODE\n
. Can I ask though - this an awesome answer and I upvoted it as soon as I read it - but I don't understand what's happening to the shell array? What does"${@//\"/\\\"}"
this do? -
rici almost 10 years@mikeserv: «For each argument as a quoted value ("$@"), replace all (//) occurrences of a quote (\") with (/) a backslash (\\) followed by a quote (\")». See Parameter expansion in the bash manual.
-
mikeserv almost 10 years@goldilocks - Yes - but only as soon as they occur. Maybe I should reword that. And yeah - you could just add a middle
sed
in and save up to a null or something then have thatsed
write this one's script; or put it in a shell function and give it values at one bite per line like"/$1/"
..."/$2/"
- maybe I'll write those functions too... -
rici almost 10 yearsThis doesn't seem to work in the case where the placeholders are
PLACE1
,PLACE2
andPLA
.PLA
always wins. OP says: "equivalent to scanning the input left-to-right for a longest match to one of the given replacement strings" (emphasis added) -
mikeserv almost 10 yearsI do generally understand parameter expansion - just not too familiar with
bash
isms. I didn't think you could do any direct expansions on the shell array itself. At least the spec calls its behavior unspecified. -
mikeserv almost 10 years@rici - thanks. Then I will have to do the null delimiters. Back in a flash.
-
rici almost 10 years@mikeserv. You could call it bashism, I guess. It also works with ksh and zsh, so it's pretty widespread. I don't know which one had it first.
-
mikeserv almost 10 yearsI think you're right -
ksh
I think. I'm bigot, I guess. Sorry. I'm just not very familiar with it. -
mikeserv almost 10 years@rici - I was just about to post another version, that will handle what you describe, but looking at it again and I don't think I should. He says longest for one of the given replacement strings. This does that. There's no indication that one string is a subset of another, only that the replaced value may be. I also don't think iterating over a list is a valid way to solve the problem. Given the problem as I understand it, this is a working solution.