Using regular expressions (regex) in sed

5,846

Solution 1

You need an automated solution, too many things to quote and keep track of.

A two step solution (not 100% perfect (there may be pathological corner cases)) is:

  1. Get the string verbatim in a variable.

    • Why? Because the contents of a (quoted) variable ("$var") is never modified (again) by the shell.
    • How? Use a quoted here-string.

    The steps are:

    • Write: IFS= read -r var <<\END on a command line
    • copy and paste the exact same string you want to process, press enter
    • write END and press enter again.

    Then, the variable var will contain the exact same string you copied on the command line, no changes, no quote removal, no nothing, just the string.

    What you should see is:

    $ IFS= read -r var <<\END
    > $GLOBALS['timechecks']=addTimeCheck_sparky($GLOBALS['timechecks'], number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
    > END
    

    Done, yes, really, that's all the complex part, copy and paste.
    You can echo the string:

    $ echo "$var"
    $GLOBALS['timechecks']=addTimeCheck_sparky($GLOBALS['timechecks'], number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
    

    Well, you better use printf '%s\n' "$var" to avoid issues with some values ofvarthat may start with a-`, but in this example echo works ok.

From this point on you will need no other typing/input/"manual escape" done.
You just need to copy-paste the command below.

  1. Use the var value to generate the exact regex used in sed to match it exactly. The kind of regex that sed accepts is called BRE (Basic Regular Expression) by POSIX.
    In BRE, there are several special characters \ . [ * * ^ $.
    If all those characters get quoted, the regex is actually a verbatim string of the original. That is easy to do (\.*^$[):

    $ echo "$var" | sed 's#\([\.*^$[]\)#\\\1#g'
    $GLOBALS\['timechecks']=addTimeCheck_sparky($GLOBALS\['timechecks'], number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
    

    That has quoted (escaped) any backslash (\), opening ([), dot (.), asterisk (*), circumflex (^) and dollar-sign ($) present. That would break any possible regex construct in var and convert all of them into a simple string. It breaks any "bracket expression" ([), any "any char" (.), any repetition (*), any anchor (^$) and any backslash (\).
    Note that any (, ), { or } doean't require escaping. If not escaped, they remain, and therefore are not like (the special \(). If escaped (\() they become \\(, also loosing any special value.

    There may be pathological corner cases that I am not able to see right now, but 99.2% of the time that simple conversion ought to be enough.

Then, you can capture the changed string, and use it in sed:

$ reg=$(echo "$var" | sed 's#\([\.*^$[]\)#\\\1#g')

$ echo "$var" | sed 's#'"$reg"'# ===any string=== #'
 ===any string=== 

If the conversion was correct, the sed command should capture the whole initial string and replace it with the right side string.

Of course, if you want a shorter part of the string matched, just start with the part that you want to match.

Additional If you want to see what kind of string you should have written to get the right string inside a variable (which requires an additional layer of quoting), you can use (bash 4.3+):

$ myvar=$(echo "${var}" | sed 's#\([\.*^$[]\)#\\\1#g')
$ echo "${myvar@Q}"
'\$GLOBALS\['\''timechecks'\'']=addTimeCheck_sparky(\$GLOBALS\['\''timechecks'\''], number_format(microtime(true),6,'\''\.'\'','\'''\''), __LINE__, basename(__FILE__));'

If you write something like:

$ myvar='\$GLOBALS\['\''timechecks'\'']=addTimeCheck_sparky(\$GLOBALS\['\''timechecks'\''], number_format(microtime(true),6,'\''\.'\'','\'''\''), __LINE__, basename(__FILE__));'

One level of quoting gets removed and you get inside myvar the required string to work with.

You can compare with your original attempt and see where it was going wrong:

Bad:     \$GLOBALS\['\''timechecks'\''\]=addTimeCheck_sparky[(]$GLOBALS\['\''timechecks'\''\][,][ ]number_format[(]microtime[(]true[)][,]6[,]'\''\.'\''[,]'\'''\''[)][,][ ]__LINE__[],[ ]basename[(]__FILE__[)][)][;]
Good:   '\$GLOBALS\['\''timechecks'\'']=addTimeCheck_sparky(\$GLOBALS\['\''timechecks'\''], number_format(microtime(true),6,'\''\.'\'','\'''\''), __LINE__, basename(__FILE__));'

Hope that this gives you a general fool proof procedure to quote anything.

Note: I built the procedure above for basic BRE regexes for sed. Those are all the regexes that sed understand (by default). If sed is called as sed -E then the Extended Regular Expressions (ERE) are used. There are some changes for ERE. The special characters list grows to: .[\()*+?{|^$, so, the escaping should be (no we can not use extended regexes here as they do not allow for back-references):

sed 's@\([\.()*+?{|^$[]\)@\\\1@g'

You can see how it works on this page I prepared

I am not addressing PCRE (Perl) JavaScript, PHP or any of many other regexes flavors as sed can not use them, period, no use.

Related:

BRE -- POSIX Basic Regular Expressions

Solution 2

\$GLOBALS\['\''timechecks'\''\]=addTimeCheck_sparky[(]$GLOBALS
                                                      ^

There's an unescaped $ there.

\['\''timechecks'\''\][,][ ]number_format[(]microtime[(]true[)]
[,]6[,]'\''\.'\''[,]'\'''\''[)][,][ ]__LINE__[],[ ]basename[(]__FILE__[)][)][;]
                                              ^^

And that should probably be [,].

Not escaping that $ doesn't even really matter (at least with GNU sed), but that [],[ ] is bracket expression with [], and space inside. It's a valid regex though, just not what you wanted, so it won't produce any errors.

But really, quoting is so painful to do. Sometimes it's better to just avoid it.

Let's just put the pattern and replacements strings in some files, along with a test file:

$ cat pat 
$GLOBALS['timechecks']=addTimeCheck_sparky($GLOBALS['timechecks'], number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
$ cat repl
hello!
$ cat test.txt
foo
$GLOBALS['timechecks']=addTimeCheck_sparky($GLOBALS['timechecks'], number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
bar

and then, replace the strings with Perl:

$ pat=$(< pat) repl=$(< repl) perl -i.bak -pe 's/\Q$ENV{pat}/$ENV{repl}/' test.txt
$ cat test.txt
foo
hello!
bar

When the strings are read from files, there's no need for quoting on the shell command line. Also, when the pattern comes from a variable, and \Q is used, there's no need to escape the special characters in the pattern. Here, I passed the strings to Perl through the environment, since it works better with -i than command line arguments. -p makes perl act a bit like sed in that it runs the given script for each input line, -i.bak is like seds -i.

Related question: Why is there no generator that accepts the target string as input and provides the regex that will find it?

Well. Usually regexes are used with patterns meant to match multiple strings, and there it might be hard for a program to know what parts can be varying. Though if you're always looking for a fixed string, it would be somewhat simple to just escape the special characters. But then you wouldn't actually need a regex engine in the first place. It's just that they're rather ubiquitous in the common Unix tools.

You mentioned in the comments that:

Come to think of it, if a line matches this string, that is all I need to know to replace it: $GLOBALS['timechecks']=addTimeCheck_sparky

Something like

sed -- -e 's/^.*GLOBALS..timechecks..=addTimeCheck_sparky.*$/hello/' 

could be used to match against that and replace the whole line. Granted, that would also match #GLOBALS_atimecheckses=addTimeCheck_sparky and related variants, since I cheated and just replaced all the special characters with .. But you get the idea.

Also, you can always take a backup copy if the original file first, then run diff original.txt processed.txt to review any changes.

Solution 3

Works for me:

sed -- 's/\$GLOBALS\['\''timechecks'\''\]/completely_different_string/g' <<'END'
foo
$GLOBALS['timechecks']=addTimeCheck_sparky($GLOBALS['timechecks'], number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
bar
END
foo
completely_different_string=addTimeCheck_sparky(completely_different_string, number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
bar

This works with both the default BSD sed and GNU sed on a Mac.


A matter of terminilogy: there is no "bash sed". bash is your interactive shell and it's also a programming language. sed is a different programming language. From bash's point of view, sed is just another command found in your $PATH, like ls or grep or ...

Share:
5,846
DanAllen
Author by

DanAllen

Database application developer.

Updated on September 18, 2022

Comments

  • DanAllen
    DanAllen over 1 year

    This is a specific example of a general subject I fail to grasp.

    For years, I have used regex and sed to find/replace all occurrences of a string in all the files in a directory recursively, using something like this:

    #FIND $GLOBALS['timechecks'] and REPLACE with completely_different_string
    shopt -s globstar dotglob;
    for file in /var/www/**/*; do
      if [[ -f $file ]] && [[ -w $file ]]; then
        sed -i -- 's/\$GLOBALS\['\''timechecks'\''\]/completely_different_string/g' "$file"
      fi
    done
    

    The problem is, there is something basic about using Regex in bash I have got away without knowing. As a result, I cannot figure out a solution to a particular example.

    TARGET STRING WHERE I AM STUCK

    $GLOBALS['timechecks']=addTimeCheck_sparky($GLOBALS['timechecks'], number_format(microtime(true),6,'.',''), __LINE__, basename(__FILE__));
    

    REGEX I CAME UP WITH NOT WORKING

    This is just the sed line from my script with the search regex I came up with, to no avail.

    \$GLOBALS\['\''timechecks'\''\]=addTimeCheck_sparky[(]$GLOBALS\['\''timechecks'\''\][,][ ]number_format[(]microtime[(]true[)][,]6[,]'\''\.'\''[,]'\'''\''[)][,][ ]__LINE__[],[ ]basename[(]__FILE__[)][)][;]
    

    REGEX DEBUGGER

    I used a regex debugger for this example, which shows the regex finding my target string, but it is not working for me. The debugger is at this link. Here is the regex it shows finding my target string:

    \$GLOBALS\['timechecks\'\]=addTimeCheck_sparky\(\$GLOBALS\[\'timechecks\'\], number_format\(microtime\(true\),6,\'\.\',''\), __LINE__, basename\(__FILE__\)\)
    

    PROBLEMS WITH OUTPUT FOR REGEX DEBUGGER:

    First, I tried my regex in the de

    1. I don't know why the debugger's regex works when I run it there, but not in my bash script.
    2. The regex looks "wrong" when compared to what I have learned to use for regex in bash with sed
    3. The regex from the debugger does not work when I plug it into the script I use for doing this task.
    4. Since I don't understand it, I cannot fix it

    I think the basic problem I am clueless about converting valid regex from the debugger to work in bash/sed.

    I searched for "how to use regex with sed in bash," but have not found an explanation of the fact this is even a potential problem.

    Related question: Why is there no generator that accepts the target string as input and provides the regex that will find it?

    • Pankaj Goyal
      Pankaj Goyal almost 4 years
      Yikes. Okay, that's a particualry complicated (if not necesessarily comples) regex you're struggling with there. Not least because of the 's and all the other special characters embedded in your pattern: some of which are special to sed, and some of which are special to bash (e. g. the aforementioned '). One thing that might simplify things at the slight risk of some false positives is to use some single-character wildcards (.) where you've got characters you have to otherwise escape (like ', $, (, etc.). The fewer escapes the better.
    • DanAllen
      DanAllen almost 4 years
      False positives would be awesome, except then strings would be changed that should not be changed. The number of files being searched is 400,000+ Is there a way to use the wildcards without a practical risk for false positives? It is not like there are all kinds of strings that almost match this one. Come to think of it, if a line matches this string, that is all I need to know to replace it: $GLOBALS['timechecks']=addTimeCheck_sparky That said, I really need to get a handle on the concepts of why the regex from the debugger is valid but does not work in bash/sed
    • Angel Todorov
      Angel Todorov almost 4 years
      That regex debugger offers javascript and pcre flavours of regular expressions. sed uses neither of those: you can use basic or extended regexes. GNU sed documents it's regular expressions at https://www.gnu.org/software/gnulib/manual/html_node/Regular‌​-expression-syntaxes‌​.html
    • roaima
      roaima almost 4 years
      Me, I'd not use single quotes to quote an expression containing single quotes if I could possibly avoid it. This works here, sed -- "s/\$GLOBALS\['timechecks'\]/completely_different_string/g"
    • DanAllen
      DanAllen almost 4 years
      @DopeGhoti Are there aspects of handling single quotes unique to bash which can affect a bash script using sed with regex?
    • DanAllen
      DanAllen almost 4 years
      @roaima 100% agree with using double quote for enclosing an expression containing single quotes. I have been in habit of using single quotes for enclosing expressions containing $, which added a big problem in this example. I have become accustomed to enclosing with single quotes, then escaping single quotes like this: sed -- 's/\$GLOBALS['\''timechecks'\'']/completely_differen‌​t_string/g' It is difficult to notice, there are no double quotation characters, but two single quotation characters after a \ for each single quote. I am switching to the much simpler way you suggested.
    • DanAllen
      DanAllen almost 4 years
      @glenn jackman The fact that there are different regex's for javascript, pcre, and two varieties of regex for sed is a crucial element to filling the knowledge gap I am seeking to fill by asking the question I posted here.
    • DanAllen
      DanAllen almost 4 years
      @roaima Can sed be run without a shell?
  • DanAllen
    DanAllen almost 4 years
    In this example, all that is needed is sed '/pattern/d', because the I was dropping the lines matching a small part of the long string. The example did not require matching the whole string, but I did not recognize that at the time I posted the question. The key insight provided in this answer is the suggestion of avoiding complications by putting breaking up a long match string into files whose contents do not have to be escaped. I hate the culture on all stack sites, this one included.
  • ilkkachu
    ilkkachu almost 4 years
    @DanAllen, I'm not sure what you mean exactly with the culture but I'm sorry if you feel that way. I tried to answer the question as I saw, and only later noticed the comment, and tried to add some note about that too, not really knowing if you already knew how to do it. "Missing a sport" was meant as a jest, but I can't see it probably wasn't that amusing, so sorry about that.
  • DanAllen
    DanAllen almost 4 years
    I am going to develop an answer to my question, based on what I have learned as a result of posting it. Culture will be inherent to the answer and the response, if any, it receives. Your answer is not the focus of the culture problem I have. The culture problem is unhelpful contraint on what is supported and the unhelpful mannerisms that are paramount here.
  • Angel Todorov
    Angel Todorov almost 4 years
    It seems like every tool that implements regular expressions has their own little quirks. Your best bet is to consult the docs for each tool you're using. Take advantage of the collective wisdom of SO: click the regular-expression tag and read the "more info" page
  • ilkkachu
    ilkkachu almost 4 years
    Since I brought Perl to this, it should probably be mentioned that one the quirks is escaping /, $ and backslash. You can't escape those with [/] [$] and [\] in Perl, but have to use \/, \$ and \\ .
  • JJoao
    JJoao almost 4 years
    ...or perl -0pe '$x=qx{cat pat};s/\Q$x/qx{cat rep}/e'
  • ilkkachu
    ilkkachu almost 4 years
    @JJoao, yeah. Though the slight difference with that is that Perl's qx// doesn't remove the trailing newlines from the output like the shell's command substitution does. So if you have patterns that should match partial lines, you'd need to take care that the files don't contain a newline. Or use $x=qx{cat pat}; chomp $x; $y=qx{cat rep}; chomp $y; s/\Q$x/$y/e or something like that.
  • DanAllen
    DanAllen almost 4 years
    This is brilliant innovation, what the dr. ordered. I am hung up in a couple spots. Putting the string into a var I see perfectly. I see "The basic three special characters in a sed regex" and then special chars on a regex .[*^$ I gather all those are incorporated into sed under 's#([\.*^$[])#\\\1#g' I also gather that within bash, this is BRE (a term I had to lookup) and end of that part of story. To adapt to PCRE, js, php or other flavor of regex, there can be different special characters to treat the way .[*^$ are treated in item 2 above? The rest I have to try.
  • done
    done almost 4 years
    @DanAllen Thanks for your words. (1) The description was a bit rough, I wrote something when I started and some hours later I completed it, there was a completely different idea at the two points in time, hope it is better now. (2)I added a direct link to BRE in POSIX, where it lists the special characters for BRE. (3)BRE are the regex for sed, that is not decided by the shell, those programs are independent. (4)There is no reason or need to adapt to PCRE as sed (and what you asked about sed -i ...) can not use such regex.
  • DanAllen
    DanAllen almost 4 years
    Those edits are going to help, because I was finding myself mighty confused by some of the details in the last third the earlier version.