Replace multiple strings in a single pass

20,774

Solution 1

OK, a general solution. The following bash function requires 2k arguments; each pair consists of a placeholder and a replacement. It's up to you to quote the strings appropriately to pass them into the function. If the number of arguments is odd, an implicit empty argument will be added, which will effectively delete occurrences of the last placeholder.

Neither placeholders nor replacements may contain NUL characters, but you may use standard C \-escapes such as \0 if you need NULs (and consequently you are required to write \\ if you want a \).

It requires the standard build tools which should be present on a posix-like system (lex and cc).

replaceholder() {
  local dir=$(mktemp -d)
  ( cd "$dir"
    { printf %s\\n "%option 8bit noyywrap nounput" "%%"
      printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
      printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
    } | lex && cc lex.yy.c
  ) && "$dir"/a.out
  rm -fR "$dir"
}

We assume that \ is already escaped if necessary in the arguments but we need to escape double quotes, if present. That's what the second argument to the second printf does. Since the lex default action is ECHO, we don't need to worry about it.

Example run (with timings for the skeptical; it's just a cheap-o commodity laptop):

$ time echo AB | replaceholder A B B A
BA

real    0m0.128s
user    0m0.106s
sys     0m0.042s
$ time printf %s\\n AB{0000..9999} | replaceholder A B B A > /dev/null

real    0m0.118s
user    0m0.117s
sys     0m0.043s

For larger inputs it might be useful to provide an optimization flag to cc, and for current Posix compatibility, it would be better to use c99. An even more ambitious implementation might try to cache the generated executables instead of generating them each time, but they're not exactly expensive to generate.

Edit

If you have tcc, you can avoid the hassle of creating a temporary directory, and enjoy the faster compile time which will help on normal sized inputs:

treplaceholder () { 
  tcc -run <(
  {
    printf %s\\n "%option 8bit noyywrap nounput" "%%"
    printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
    printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
  } | lex -t)
}

$ time printf %s\\n AB{0000..9999} | treplaceholder A B B A > /dev/null

real    0m0.039s
user    0m0.041s
sys     0m0.031s

Solution 2

printf 'STRING1STRING1\n\nSTRING2STRING1\nSTRING2\n' |
od -A n -t c -v -w1 |
sed 's/ \{1,3\}//;s/\\$/&&/;H;s/.*//;x
     /\nS\nT\nR\nI\nN\nG\n1/s//STRING2/
     /\nS\nT\nR\nI\nN\nG\n2/s//STRING1/
     /\\n/!{x;d};s/\n//g;s/./\\&/g' |
     xargs printf %b

###OUTPUT###

STRING2STRING2

STRING1STRING2
STRING1

Something like this will always replace each occurrence of your target strings only once as they occur in sed's in stream at one bite per line. This is the fastest way I can imagine you'd do it. Then again, I don't write C. But this does reliably handle null delimiters if you wish it. See this answer for how it works. This has no problems with any contained special shell characters or similar - but it is ASCII locale specific, or, in other words, od will not output multi-byte characters on the same line and will only do one per. If this is a problem you'll want to add in iconv.

Share:
20,774

Related videos on Youtube

Ambroz Bizjak
Author by

Ambroz Bizjak

Updated on September 18, 2022

Comments

  • Ambroz Bizjak
    Ambroz Bizjak over 1 year

    I'm looking for a way to replace placeholder strings in a template file with concrete values, with common Unix tools (bash, sed, awk, maybe perl). It is important that the replacement is done in a single pass, that is, what is already scanned/replaced must not be considered for another replacement. For example, these two attempts fail:

    echo "AB" | awk '{gsub("A","B");gsub("B","A");print}'
    >> AA
    
    echo "AB" | sed 's/A/B/g;s/B/A/g'
    >> AA
    

    The correct result in this case is of course BA.

    In general, the solution should be equivalent to scanning the input left-to-right for a longest match to one of the given replacement strings, and for each match, performing a replacement and continuing from that point on in the input (none of the already read input nor the replacements performed should be considered for matches). Actually, the details don't matter, just that the results of the replacement are never considered for another replacement, in whole or in part.

    NOTE I am only looking for correct generic solutions. Please do not propose solutions which fail for certain inputs (input files, search and replace pairs), however unlikely they may seem.

    • Admin
      Admin almost 10 years
      I assume they're longer than one character? For this you could use tr AB BA.
    • Admin
      Admin almost 10 years
      Yes, they're longer than one character. Assume that both the match and replacement strings are arbitrary strings.
    • Admin
      Admin almost 10 years
      echo "AB"|sed 's/A/!!TMP!!/g;s/B/A/g;s/!!TMP!!/B/g'
    • Admin
      Admin almost 10 years
      @bersch It's not correct with respect to all possible inputs.
    • Admin
      Admin almost 10 years
      In that case you need to make your question more precise to make clear how general the patterns and inputs can be.
    • Admin
      Admin almost 10 years
      And frankly, I wouldn't be surprised if someone considered your note a bit rude.
    • Admin
      Admin almost 10 years
      @peterph I was assuming that unless stated otherwise solutions need to be generic and correct. For some reason unknown to me people in Unix circles like half-solutions a lot.
    • Admin
      Admin almost 10 years
      Well, you're not going to make it any better like this... Back to the point: specify how general outputs and inputs can be.
    • Admin
      Admin almost 10 years
      I'm afraid you'll need to do it exactly as you are describing it - parse from the beginning and replace as you go - i.e. not with regular expressions.
    • Admin
      Admin almost 10 years
      Any chance a mail merge could do the job?
    • Admin
      Admin almost 10 years
      @AmbrozBizjak: there is probably no better solution with sed. BTW: you can parse the string with a simple for loop. Nobody answered so far, perhaps because it is really not a problem.
    • Admin
      Admin almost 10 years
      @goldilocks - there are lots of little tools that can do it combination. You probably could do it with just sed if you used a couple instances, the l function, and a couple of pipes between them. But I do it below with od and sed - well, and xargs and printf... In fairness, though, it is ASCII specific. dc could do this as well, I think, but with the same limitation.
    • Admin
      Admin almost 10 years
      regarding @goldilocks' comment - obligatory reference to the canonical SO question is in place.
  • Ambroz Bizjak
    Ambroz Bizjak almost 10 years
    I'm not sure if this is a joke or not ;)
  • rici
    rici almost 10 years
    @ambrozbizjak: It works, it's quick for large inputs and acceptably fast for small inputs. It might not use the tools you were thinking of but they are standard tools. Why would it be a joke?
  • goldilocks
    goldilocks almost 10 years
    +1 For not being a joke! :D
  • goldilocks
    goldilocks almost 10 years
    +1 Why do you say it only replaces "the earliest occurrence of your target strings"? In the output it looks as if it replaces all of them. I'm not asking to see it, but could this be done this way without hardcoding the values?
  • mikeserv
    mikeserv almost 10 years
    That would be POSIX portable like fn() { tcc ; } <<CODE\n$(gen code)\nCODE\n. Can I ask though - this an awesome answer and I upvoted it as soon as I read it - but I don't understand what's happening to the shell array? What does "${@//\"/\\\"}" this do?
  • rici
    rici almost 10 years
    @mikeserv: «For each argument as a quoted value ("$@"), replace all (//) occurrences of a quote (\") with (/) a backslash (\\) followed by a quote (\")». See Parameter expansion in the bash manual.
  • mikeserv
    mikeserv almost 10 years
    @goldilocks - Yes - but only as soon as they occur. Maybe I should reword that. And yeah - you could just add a middle sed in and save up to a null or something then have that sed write this one's script; or put it in a shell function and give it values at one bite per line like "/$1/"..."/$2/" - maybe I'll write those functions too...
  • rici
    rici almost 10 years
    This doesn't seem to work in the case where the placeholders are PLACE1, PLACE2 and PLA. PLA always wins. OP says: "equivalent to scanning the input left-to-right for a longest match to one of the given replacement strings" (emphasis added)
  • mikeserv
    mikeserv almost 10 years
    I do generally understand parameter expansion - just not too familiar with bashisms. I didn't think you could do any direct expansions on the shell array itself. At least the spec calls its behavior unspecified.
  • mikeserv
    mikeserv almost 10 years
    @rici - thanks. Then I will have to do the null delimiters. Back in a flash.
  • rici
    rici almost 10 years
    @mikeserv. You could call it bashism, I guess. It also works with ksh and zsh, so it's pretty widespread. I don't know which one had it first.
  • mikeserv
    mikeserv almost 10 years
    I think you're right - ksh I think. I'm bigot, I guess. Sorry. I'm just not very familiar with it.
  • mikeserv
    mikeserv almost 10 years
    @rici - I was just about to post another version, that will handle what you describe, but looking at it again and I don't think I should. He says longest for one of the given replacement strings. This does that. There's no indication that one string is a subset of another, only that the replaced value may be. I also don't think iterating over a list is a valid way to solve the problem. Given the problem as I understand it, this is a working solution.