PCRE-regex Use grep to exclude a capturing group

7,858

grep's name comes after the g/re/p ed command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You have sed (the stream editor) or awk for that.

Now, some grep implementations, starting with GNU grep added a -o option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got some grep implementation like GNU's again (with -P) or pcregrep that support PCREs for their regexps.

pcregrep actually added a -o<n> option to print the content of a capture group. So you could do:

pcregrep -o1 -o2 --om-separator=' ' '.zoo.(\d+).*:\s+(.*)'

But here, the obvious standard solution is to use sed:

sed -n 's/^.*\.zoo\.\([0-9]\{1,\}\).*:[[:space:]]\{1,\}/\1 /p'

Or if you want perl regexps, use perl:

perl -lne 'print "$1 $2" if /\.zoo\.(\d+).*:\s+(.*)/'

With GNU grep, if you don't mind the matches to appear on different lines, you can do:

$ grep -Po '\.zoo\.\K\d+|:\s+\K.*' < file
2
0.45654343

Note that while \K resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.

grep -Po '.zoo.(\K\d+|.: \K.)'

would not work, just like echo foobar | grep -Po 'foo|foob' wouldn't work (at printing both foo and foob). foo|foob first matches foo and then grep looks for potential other matches in the input after the foo, so starting at the b of bar, so can't find any more after that.

Above with grep -Po '\.zoo\.\K\d+|:\s+\K.*', we only look for :<spaces><anything> in the second part of the alternation. That does match in the part that is after .zoo.<digits> but that also means it would find those :<spaces><anything> anywhere in the input, not only when they follow .zoo.<digits>.

There is a way to work around that though, using another PCRE special operator: \G. \G matches at the start of the subject. For a single match, that's equivalent to ^, but with multiple matches (think of sed/perl's g flag in s/.../.../g) like with -o where grep tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:

grep -Po '\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

Where (?!^) is a negative look-ahead operator that means not at the beginning of the line, that \G will only match after a previous successful (non-empty) match, so .*:\s+\K.* will only match if it follows a previous successful match, and that can only be the .foo.<digits> one since the other part of the alternation matches til the end of the line.

On an input like:

.zoo.1.zoo.2 tar: blah

That would output:

1
2
blah

Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like

grep -Po '^.*?\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

That still outputs 2 on an input like .zoo.2 no colon character or .zoo.2 blah:. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after :<spaces> (and also using $ to avoid issues with non-characters)

grep -Po '^.*?\.zoo\.\K\d+(?=.*:\s+\S.*$)|(?!^)\G.*:\s+\K\S.*$'

You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward sed/perl solutions...

Share:
7,858

Related videos on Youtube

Inian
Author by

Inian

Yet Another Software Engineer meandering in spacetime, dabbling mostly in Go and very lately into Rust. If I helped you solve your technical problem and would like to thank me, consider buying me a coffee here.

Updated on September 18, 2022

Comments

  • Inian
    Inian over 1 year

    I am using GNU grep with the -P PCRE Regex support for matching strings from a file. The input file has lines containing strings like:

    FOO_1BAR.zoo.2.someString:More-RandomString (string here too): 0.45654343
    

    I want to capture the numbers 2 and 0.45654343 from the above line. I used a regEx

    grep -Po ".zoo.\K[\d+](.*):\ (.*)$" file
    

    But this is producing me a result as

    2.someString:More-RandomString (string here too): 0.45654343
    

    I am able to get the first number from the first capturing group as 2, and also to match a capturing group at the end of the line. But I am not able to skip the words/lines between two capturing groups.

    I know for a fact that I have a group (.*) that is capturing those words in the middle. What I've tried to do is include another \K to ignore it as

    grep -Po ".zoo.\K[\d+](.*):\K (.*)$" file
    

    But that gave me only the second capture group as 0.556984.

    Also with a non-capturing group with the (?:) syntax as

    grep -Po ".zoo.\K[\d+](?=.someString:More-RandomString (string here too)):\ (.*)$"
    

    But this gave me nothing. What am I missing here?

    • Admin
      Admin over 7 years
      You're missing basic understanding of how Perl regexps are supposed to work. You're also missing basic sense of not trying to do this with a single grep command.
    • Admin
      Admin over 7 years
      @SatoKatsura: I wanted to use a single grep and I hoped it would be possible. And the reason for You're missing basic understanding of how Perl regexps are supposed to work? I did a decent attempt to solving the issue
    • Admin
      Admin over 7 years
      \K doesn't do what you seem to think it does. Neither does [\d+].
    • Admin
      Admin over 7 years
      @SatoKatsura: Why do you think that? Can you point me how is it incorrect?
    • Admin
      Admin over 7 years
      Because (1) it doesn't make sense to have more than one \K in the same regexp, and (2) how do you explain the output of something like this: echo 1+2 | grep -Po '[\d+]'?
    • Admin
      Admin over 7 years
      @SatoKatsura: Appreciate your comments. Will learn more about PCRE syntaxes.
  • Inian
    Inian over 7 years
    Appreciate your answer, did I miss something in my question. Do you mean that I simply can't do what I intended to do with a single grep?
  • Stéphane Chazelas
    Stéphane Chazelas over 7 years
    @Inian, You can't easily with a single invocation of the current version of GNU grep (the one I suppose you're trying to use as it seems it supports -P and -o though that could also be the one of FreeBSD/OS/X that are rewrites of GNU grep). You can with other grep implementations like pcregrep. But I argue you're picking the wrong tool for the task. Use sed to edit streams.
  • Inian
    Inian over 7 years
    I am quite easily able to do this only using bash native regex as [[ "$string" =~ .zoo.([[:digit:]]+).*:\ (.*)$ ]] and print as printf "%s\t%s\n" "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]//[[:blank:]]}"
  • Inian
    Inian over 7 years
    Thought grep could do this in someway. Anyway am accepting this answer agreeing it can't be done with a single invocation and some useful stuff on pcregrep which I haven't used before.
  • Inian
    Inian over 7 years
    Actually, the syntax grep -Po '\.zoo\.\K\d+|: \K.*' worked fine for me? But is there a way you can tell me to remove the whitespaces in the 2nd capturing group? It is currently printing it with a space in a new line.
  • Stéphane Chazelas
    Stéphane Chazelas over 7 years
    See edit. Replaced one space with \s+ as I suppose you had more than one space after the :. Also added a way to make sure the :\s+.* only matches if .zoo.<digits> has been found beforehand.
  • Inian
    Inian over 7 years
    Now this a great answer!