Regular expression in bash script

28,596

Solution 1

From man 7 regex:

A bracket expression is a list of characters enclosed in "[]". …

… To include a literal '-', make it the first or last character…. [A]ll other special characters, including '\', lose their special significance within a bracket expression.

Trying the regexp with egrep gives an error:

$ echo "username : username usergroup" | egrep "^([a-zA-Z0-9\-_]+ : [a-zA-Z0-9\-_]+) (usergroup)$"
egrep: Invalid range end

Here is a simpler version, that also gives an error:

$ echo 'hi' | egrep '[\-_]'
egrep: Invalid range end

Since \ is not special, that is a range, just like [a-z] would be. You need to put your - at the end, like [_-] or:

echo "username : username usergroup" | egrep "^([a-zA-Z0-9_-]+ : [a-zA-Z0-9_-]+) (usergroup)$"
username : username usergroup

This should work regardless of your libc version (in either egrep or bash).

edit: This actually depends on your locale settings too. The manpage does warn about this:

Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them.

For example:

$ echo '\_' | LC_ALL=en_US.UTF8 egrep '[\-_]'
egrep: Invalid range end
$ echo '\_' | LC_ALL=C egrep '[\-_]'
\_

Of course, even though it didn't error, it isn't doing what you want:

$ echo '\^_' | LC_ALL=C egrep '^[\-_]+$'
\^_

It's a range, which in ASCII, includes \, [, ^, and _.

Solution 2

General rule with regexps (and any bugs in larger pieces of code): cut it down and rebuild it step by step or use bisecting - whatever works better for you.

In this case the culprit turned out to be the underscore - escaping it with a backslash has made it work.

Share:
28,596

Related videos on Youtube

Adam Westh
Author by

Adam Westh

I like to build scalable CQRS/ES applications, and turn complex business ideas into beautiful software.

Updated on September 18, 2022

Comments

  • Adam Westh
    Adam Westh almost 2 years

    This is my first time bash scripting so I'm probably making an easy mistake.

    Basically, I'm trying to write a script that gets the groups of a user, and if they are in a certain group, it will log that accordingly. Evidently there will be more functionality, but there's no point building that when I can't even get the regex working!

    So far, I have this:

    #!/bin/bash
    
    regex="^([a-zA-Z0-9\-_]+ : [a-zA-Z0-9\-_]+) (usergroup)$"
    
    # example output
    groups="username : username usergroup"
    
    echo "$groups" >> /home/jrdn/log
    
    if [[ "$groups" =~ $regex ]]; then
        echo "Match!" >> /home/jrdn/log
    else
        echo "No match" >> /home/jrdn/log
    fi
    

    Every place I've tried that regex, it works. But in the bash script, it only ever outputs the $groups, followed by No match. So can someone tell me what's wrong with it?

    • manatwork
      manatwork over 10 years
      What makes you think anything is wrong with it?
    • Adam Westh
      Adam Westh over 10 years
      It echoes "No match". Could be something wrong with the comparison, there's something wrong somewhere.
    • peterph
      peterph over 10 years
      Works for me. What version of bash do you have?
    • Adam Westh
      Adam Westh over 10 years
      GNU bash, version 4.2.37(1)-release (x86_64-pc-linux-gnu)
    • manatwork
      manatwork over 10 years
      Works for me too. bash 4.1.10(4). pastebin.com/PgyiZujJ Actually I see no reason to not work. How you run it?
    • peterph
      peterph over 10 years
      Interesting, looks like something in your environment. How about trying a much simpler regex like trying to match ^a on "asd" and "qwe" and then expanding it piece by piece?
    • Adam Westh
      Adam Westh over 10 years
      @manatwork: just running it like: ./install.sh @peterph: running ^([a]) against abc and dbc returns the proper results
    • peterph
      peterph over 10 years
      @jrdnhannah then try to slowly re-create your target regexp, first match ^([a-zA-Z0-9\-_]+) then add the colon and so on... you should find out pretty soon, where is the problem.
    • Adam Westh
      Adam Westh over 10 years
      @peterph I just tried running it on my mac, on the off chance it works.. And it does. I will simple it down though, and work out what my box doesn't like, and then try and figure out why it doesn't like it. Thanks
    • terdon
      terdon over 10 years
      Same here with bash 4.2.45. Escaping the underscore fixed it. Weird. @jrdnhannah could you write that up as an answer and accept it please?
    • Adam Westh
      Adam Westh over 10 years
      Since I've only just signed up to the Unix SE, it requires me to wait 8 hours before answering my own. Happy to mark it as answered if somebody else does, though.
    • peterph
      peterph over 10 years
      There you go. Interesting thing is that my Bash 4.2.45 was ok with the unescaped underscore.
    • derobert
      derobert over 10 years
      Sounds like a bug in bash and/or [e]glibc. Broken on my Debian 4.2.45(1). Same problem with egrep; so this is probably eglibc, not bash. I have 2.17-92+b1. Actually, by the docs, the regex is wrong...
    • terdon
      terdon over 10 years
      @peterph seriously? It worked on your bash 4.2.45(1)? Which distro?
    • derobert
      derobert over 10 years
      @terdon bash just calls libc's regex functions, probably. So it depends on the libc version, not the bash version. See my answer... (Or maybe even on the collation sequence you have in use)
    • peterph
      peterph over 10 years
      @terdon seems that my LC_COLLATE=POSIX (which is the only thing differing from my [ll_CC].utf8) "saved" me again. :)
  • manatwork
    manatwork over 10 years
    Interesting. My egrep gives no error, just matches it correctly.
  • derobert
    derobert over 10 years
    @manatwork your collation sequence probably allows the range....
  • manatwork
    manatwork over 10 years
    I not know much about collation. You mean this: LC_COLLATE="en_US.UTF-8"?
  • derobert
    derobert over 10 years
    @manatwork I've edited the question to give an example. Note it may be different on your system, because sometimes those collation (sorting) sequences change.
  • manatwork
    manatwork over 10 years
    Yes, thank you. I noticed the edit too late.
  • derobert
    derobert over 10 years
    @manatwork Its OK, I almost filed a bug report before I noticed the attempt to escape -...