How to get group results using grep?

9,111

Assuming that the pattern in pattern.txt is

(.*)(\d+)(.*)

then, using it with GNU grep would be a matter of

grep -E -f pattern.txt line.txt

i.e., search in line.txt for lines matching any of the extended regular expressions listed in pattern.txt, which, given the data in the question, produces

This order was placed for QT3000! OK?

The issue with your command was that you used -e -f. The -e option is used for explicitly saying "the next argument is the expression". This means that -e -f will be interpreted as "the regular expression to use is -f". You then applied this in searching for matches in both the files mentioned on the command line.

A secondary issue was the \\d in the pattern.txt file, which matches a backslash followed by the character d, i.e. the literal string \d.

The pattern has a few other "issues". It first of all uses a non-standard expression to match a digit, \d. This is better written as [[:digit:]] or as the range [0-9] (in the POSIX standard locale). Since regular expressions matches on substrings, as opposed to filename globbing patterns which are always automatically anchored, neither of the .* bits of the pattern is needed. Likewise, the parentheses are not needed at all as they serve no function in the pattern. The + isn't needed either as a single digit would be matched by the preceding expression (a single digit is "one or more digits").

This means that to extract all lines that contains (at least) one digit, you may instead use the pattern [[:digit:]] or [0-9], or \d if you want to keep using Perl-like expressions with GNU grep, with no other decorations. For the difference between these, please see Difference between [0-9], [[:digit:]] and \d.

To get the three different outputs that you show in the question, use sed rather than grep. You want to use sed because grep can only print matching lines (or words), but not really modify the data matched.

  1. Insert Found value:  in front of any line containing a digit, and print those lines:

    $ sed -n '/[[:digit:]]/s/^/Found value: /p' line.txt
    Found value: This order was placed for QT3000! OK?
    
  2. Insert Found value: in front of any line containing a digit, and print those lines up to the end of the 3rd digit found (or to at most the 3rd digit; may output fewer digits at the end if there are fewer consecutive digits in the first substring of digits on the line):

    $ sed -n '/[[:digit:]]/s/\([^[:digit:]]*[[:digit:]]\{1,3\}\).*/Found value: \1/p' line.txt
    Found value: This order was placed for QT300
    
  3. Insert Found value: in front of any line containing a digit, and print the last digit from the line:

    $ sed -n '/[[:digit:]]/s/.*\([[:digit:]]\).*/Found value: \1/p' line.txt
    Found value: 0
    

Using an equivalent regular expression as you used, we can see what bits of the text it matches:

$ sed 's/\(.*\)\([[:digit:]]\{1,\}\)\(.*\)/(\1)(\2)(\3)/' line.txt
(This order was placed for QT300)(0)(! OK?)

Note that \2 only matches the last digit on the line as the preceding .* is greedy.

Share:
9,111
Nicholas Saunders
Author by

Nicholas Saunders

Updated on September 18, 2022

Comments

  • Nicholas Saunders
    Nicholas Saunders over 1 year

    How would I get this output:

    Found value: This order was placed for QT3000! OK?
    

    or

    Found value: This order was placed for QT300
    

    or

    Found value: 0
    

    using line.txt and pattern.txt as below:

    [nsaunders@rolly regex]$ 
    [nsaunders@rolly regex]$ grep -e -f pattern.txt line.txt 
    [nsaunders@rolly regex]$ 
    [nsaunders@rolly regex]$ cat pattern.txt 
    (.*)(\\d+)(.*)
    [nsaunders@rolly regex]$ 
    [nsaunders@rolly regex]$ cat line.txt 
    This order was placed for QT3000! OK?
    [nsaunders@rolly regex]$ 
    

    utilizing something similar to m.group(0) from a tutorial on regex.

    Perhaps grep doesn't have such notion as:

    Groups and capturing
    Group number
    
    Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:
    
        1       ((A)(B(C)))
        2       (A)
        3       (B(C))
        4       (C)
    
    Group zero always stands for the entire expression.
    
    Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete. 
    
    • Sundeep
      Sundeep almost 4 years
    • Nicholas Saunders
      Nicholas Saunders almost 4 years
      I'm using the -e switch @Sundeep, but is that not sufficient? Perhaps you would elaborate a bit, and thanks for the link.
    • Sundeep
      Sundeep almost 4 years
      could you explain how does one line This order was placed for QT3000! OK? translates to three lines of output? you need -E switch for () to act as capture groups.. \d is not supported by grep (unless you have GNU grep which has PCRE support)
    • Nicholas Saunders
      Nicholas Saunders almost 4 years
      oh, pardon, what I mean is generate each of those three lines using some notion of group(x) with grep. thanks, I updated the question. I'm not sure how \d factors in here. but, yes, I'm asking about capture groups. Hmm, I'm looking into -e versus -E now, thanks...
    • Sundeep
      Sundeep almost 4 years
      do you want to print all lines containing a digit character? grep '[0-9]' line.txt ?
    • Sundeep
      Sundeep almost 4 years
      if you want only the digits, grep -oE '[0-9]+' (provided you grep supports -o option)