PCRE-regex Use grep to exclude a capturing group
grep
's name comes after the g/re/p
ed
command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You have sed
(the stream editor) or awk
for that.
Now, some grep
implementations, starting with GNU grep
added a -o
option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got some grep
implementation like GNU's again (with -P
) or pcregrep
that support PCREs for their regexps.
pcregrep
actually added a -o<n>
option to print the content of a capture group. So you could do:
pcregrep -o1 -o2 --om-separator=' ' '.zoo.(\d+).*:\s+(.*)'
But here, the obvious standard solution is to use sed
:
sed -n 's/^.*\.zoo\.\([0-9]\{1,\}\).*:[[:space:]]\{1,\}/\1 /p'
Or if you want perl regexps, use perl:
perl -lne 'print "$1 $2" if /\.zoo\.(\d+).*:\s+(.*)/'
With GNU grep
, if you don't mind the matches to appear on different lines, you can do:
$ grep -Po '\.zoo\.\K\d+|:\s+\K.*' < file
2
0.45654343
Note that while \K
resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.
grep -Po '.zoo.(\K\d+|.: \K.)'
would not work, just like echo foobar | grep -Po 'foo|foob'
wouldn't work (at printing both foo
and foob
). foo|foob
first matches foo
and then grep
looks for potential other matches in the input after the foo
, so starting at the b
of bar
, so can't find any more after that.
Above with grep -Po '\.zoo\.\K\d+|:\s+\K.*'
, we only look for :<spaces><anything>
in the second part of the alternation. That does match in the part that is after .zoo.<digits>
but that also means it would find those :<spaces><anything>
anywhere in the input, not only when they follow .zoo.<digits>
.
There is a way to work around that though, using another PCRE special operator: \G
. \G
matches at the start of the subject. For a single match, that's equivalent to ^
, but with multiple matches (think of sed
/perl
's g
flag in s/.../.../g
) like with -o
where grep
tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:
grep -Po '\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'
Where (?!^)
is a negative look-ahead operator that means not at the beginning of the line, that \G
will only match after a previous successful (non-empty) match, so .*:\s+\K.*
will only match if it follows a previous successful match, and that can only be the .foo.<digits>
one since the other part of the alternation matches til the end of the line.
On an input like:
.zoo.1.zoo.2 tar: blah
That would output:
1
2
blah
Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like
grep -Po '^.*?\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'
That still outputs 2
on an input like .zoo.2 no colon character
or .zoo.2 blah:
. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after :<spaces>
(and also using $
to avoid issues with non-characters)
grep -Po '^.*?\.zoo\.\K\d+(?=.*:\s+\S.*$)|(?!^)\G.*:\s+\K\S.*$'
You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward sed
/perl
solutions...
Related videos on Youtube
Inian
Yet Another Software Engineer meandering in spacetime, dabbling mostly in Go and very lately into Rust. If I helped you solve your technical problem and would like to thank me, consider buying me a coffee here.
Updated on September 18, 2022Comments
-
Inian over 1 year
I am using
GNU grep
with the-P
PCRE Regex support for matching strings from a file. The input file has lines containing strings like:FOO_1BAR.zoo.2.someString:More-RandomString (string here too): 0.45654343
I want to capture the numbers
2
and0.45654343
from the above line. I used a regExgrep -Po ".zoo.\K[\d+](.*):\ (.*)$" file
But this is producing me a result as
2.someString:More-RandomString (string here too): 0.45654343
I am able to get the first number from the first capturing group as
2
, and also to match a capturing group at the end of the line. But I am not able to skip the words/lines between two capturing groups.I know for a fact that I have a group
(.*)
that is capturing those words in the middle. What I've tried to do is include another\K
to ignore it asgrep -Po ".zoo.\K[\d+](.*):\K (.*)$" file
But that gave me only the second capture group as
0.556984
.Also with a non-capturing group with the
(?:)
syntax asgrep -Po ".zoo.\K[\d+](?=.someString:More-RandomString (string here too)):\ (.*)$"
But this gave me nothing. What am I missing here?
-
Admin over 7 yearsYou're missing basic understanding of how Perl regexps are supposed to work. You're also missing basic sense of not trying to do this with a single
grep
command. -
Admin over 7 years@SatoKatsura: I wanted to use a single
grep
and I hoped it would be possible. And the reason forYou're missing basic understanding of how Perl regexps are supposed to work
? I did a decent attempt to solving the issue -
Admin over 7 years
\K
doesn't do what you seem to think it does. Neither does[\d+]
. -
Admin over 7 years@SatoKatsura: Why do you think that? Can you point me how is it incorrect?
-
Admin over 7 yearsBecause (1) it doesn't make sense to have more than one
\K
in the same regexp, and (2) how do you explain the output of something like this:echo 1+2 | grep -Po '[\d+]'
? -
Admin over 7 years@SatoKatsura: Appreciate your comments. Will learn more about PCRE syntaxes.
-
-
Inian over 7 yearsAppreciate your answer, did I miss something in my question. Do you mean that I simply can't do what I intended to do with a single
grep
? -
Stéphane Chazelas over 7 years@Inian, You can't easily with a single invocation of the current version of GNU
grep
(the one I suppose you're trying to use as it seems it supports-P
and-o
though that could also be the one of FreeBSD/OS/X that are rewrites of GNU grep). You can with othergrep
implementations likepcregrep
. But I argue you're picking the wrong tool for the task. Usesed
to edit streams. -
Inian over 7 yearsI am quite easily able to do this only using bash native regex as
[[ "$string" =~ .zoo.([[:digit:]]+).*:\ (.*)$ ]]
and print asprintf "%s\t%s\n" "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]//[[:blank:]]}"
-
Inian over 7 yearsThought
grep
could do this in someway. Anyway am accepting this answer agreeing it can't be done with a single invocation and some useful stuff onpcregrep
which I haven't used before. -
Inian over 7 yearsActually, the syntax
grep -Po '\.zoo\.\K\d+|: \K.*'
worked fine for me? But is there a way you can tell me to remove the whitespaces in the 2nd capturing group? It is currently printing it with a space in a new line. -
Stéphane Chazelas over 7 yearsSee edit. Replaced one space with
\s+
as I suppose you had more than one space after the:
. Also added a way to make sure the:\s+.*
only matches if.zoo.<digits>
has been found beforehand. -
Inian over 7 yearsNow this a great answer!