How to use [\w]+ in regular expression in sed?

178

Solution 1

Different tools and versions thereof support different variants of regular expressions. The documentation of each will tell you what they support.

Standards exist so that one can rely on a minimum set of features that are available across all conforming applications.

For instance, all modern implementations of sed and grep implement basic regular expressions as specified by POSIX (at least one version or the other of the standard, but that standard has not evolved a lot in that regard in the last few decades).

In POSIX BRE and ERE, you have the [:alnum:] character class. That matches letters and digits in your locale (note that often includes a lot more than a-zA-Z0-9 unless the locale is C).

So:

grep -x '[[:alnum:]_]\{1,\}'

matches one or more alnums or _.

[\w] is required by POSIX to match either backslash or w. So you won't find a grep or sed implementation where that's available (unless via non-standard options).

The behaviour for \w alone is not specified by POSIX, so implementations are allowed to do what they want. GNU grep added that a long time ago.

GNU grep used to have its own regexp engine however it now uses the GNU libc's one (though it does embed its own copy).

It's meant to match alnums and underscore in your locale. However, it currently has a bug in that it only matches single-byte characters (for instance, not é in a UTF-8 locale even though that's clearly a letter and even though it does match é in all the locales where é is a single character).

There also is a \w regexp operator in perl regexp and in PCRE. PCRE/perl are not POSIX regular expressions, they're just another thing altogether.

Now, with the way GNU grep -P uses PCRE, it's got the same issue as without -P. It can be worked around there though by using (*UCP) (though that also has side-effects in non-UTF8 locales).

GNU sed also uses the GNU libc's regexs for its own regexps. It uses it in such a way though that it doesn't have the same bug as GNU grep.

GNU sed doesn't support PCREs. There's some evidence in the code that it has been attempted before, but it doesn't seem to be on the agenda anymore.

If you want Perl's regular expressions, just use perl though.

Otherwise, I'd say that rather than trying to rely on a bogus non-standard feature of your particular implementation of sed/grep, it would be better to stick with the standard and use [_[:alnum:]].

Solution 2

You are correct - \w is part of PCRE - perl compatible regular expressions. It's not part of the 'standard' regex though. http://www.regular-expressions.info/posix.html

Some versions of sed may support it, but I'd suggest the easiest way is to just use perl in sed mode by specifying the -p flag. (Along with the -e). (More detail in perlrun)

But you don't need [] around it in that example - that's for groups of valid stuff.

echo here  | perl -pe 's/\w+/gone/'

Or on Windows:

C:\>echo here  | perl -pe "s/\w+/gone/"
gone
C:\>echo here  | perl -pe "s/[\w\/]+/gone/"
gone

See perlre for more PCRE stuff.

You can get perl here: http://www.activestate.com/activeperl/downloads

Solution 3

I suspect that grep and sed are deciding differently when to apply the [] and when to expand the \w. In perl regex \w means any word character, and [] define a group to apply any of the characters within as a match. If you "expand" the \w before the [] it will be a character class of all the word characters. If, instead you do [] first you will have a character class with two characters \ and w so it would match any pattern containing one or more of those two characters.

So it seems that sed is seeing the [] and treating it as containing the exact chars to match instead of honoring the special sequence \w as perl and grep do. Of course, the [] are completely unnecessary in this example, but one could perhaps imagine cases where it would be important, but then you could make it work with parens and ors.

Share:
178

Related videos on Youtube

Joao Garin
Author by

Joao Garin

Updated on September 18, 2022

Comments

  • Joao Garin
    Joao Garin over 1 year

    I am making a very simple marketplace app using the new SDK (Oauth 2.0). One of the steps would be to automatically invite team members for a closed group so I would need access to team members (users in same domain) from the user that is starting the process going through the default "navigator icon in google navigation menu".

    This is working fine, however it is only working for administrators (tried with both Directory API and Profiles data API).Is there a way to simply "read" the email from users without needing to have administrator rights? It seems quite an overkill to ask a user to be administrator just for the purpose of being able to invite his team members.

    These email addresses are in the user contact list for example, when writing an email they are automatically there so it shoulnt be much of permission problem I guess. can anyone help a bit on how I can accomplish this? Maybe a different API that I have not found?

    Very much appreciated, Best regards, Joao Garin

  • Joao Garin
    Joao Garin over 10 years
    Thank you for the tip Arun. I will try this out.
  • Sobrique
    Sobrique about 9 years
    I would be surprised if that were so. \ is an escape code, and you'd use it for escaping delimiters. Inherently that means it has to have a higher precedence than any thing else. I think it more likely that it's not implemented because \w isn't part of the regular expression spec
  • Eric Renouf
    Eric Renouf about 9 years
    Well, empirically it seems to be the case using gnu sed for me: echo whe\\ere | sed -r 's/[\w]+/gone/g gives me gonehegoneere as though it is matching each of the ` and w` and doing the substitution
  • Sobrique
    Sobrique about 9 years
    In which case, it's probably a quoting problem. Either way - perl can do it :).
  • bers
    bers about 9 years
    I can confirm what Eric Renouf is seeing. So we want to unescape the backslash somehow? :)
  • bers
    bers about 9 years
    [_[:alnum:]] is a nice workaround which allows me to extend it just like [\w/] ([_[:alnum:]/] in that case).
  • bers
    bers about 9 years
    Thanks! Stéphane Chazelas' answer is a little closer to what I asked for (since I don't have perl installed - a du*b Windows user, I guess), so I accepted his answer.
  • Eric Renouf
    Eric Renouf about 9 years
    I don't think that's the right answer. Sed just doesn't support mixing the different types of character class definitions, so the answer is if you must use both types of character classes pick another tool, or if you're picking sed use the syntax it supports
  • Stéphane Chazelas
    Stéphane Chazelas about 9 years
    \w was in GNU grep (in the 80s) before being in perl and in GNU emacs probably even before that.
  • Stéphane Chazelas
    Stéphane Chazelas about 7 years
    This answer is now outdated with regards to the limitations of GNU grep.