how to use sed, awk, or gawk to print only what is matched?

90,667

Solution 1

My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:

sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt

For matching at least one numeric character without +, I would use:

sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

Solution 2

You can use sed to do this

 sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
  • -n don't print the resulting line
  • -r this makes it so you don't have the escape the capture group parens().
  • \1 the capture group match
  • /g global match
  • /p print the result

I wrote a tool for myself that makes this easier

rip 'abc(\d+)xyz' '$1'

Solution 3

I use perl to make this easier for myself. e.g.

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'

This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.

The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).

You can do this will multiple file names on the end also. e.g.

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

Solution 4

You can use awk with match() to access the captured group:

$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345

This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.


With grep you can use a look-behind and look-ahead:

$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345

$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345

This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.

Solution 5

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.

If not then here's the best sed I could come up with:

sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).

The problem with something like:

sed -e 's/.*\([0-9]*\).*/&/' 

.... or

sed -e 's/.*\([0-9]*\).*/\1/'

... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).

Share:
90,667

Related videos on Youtube

Stéphane
Author by

Stéphane

Linux, Ubuntu, C++ developer. https://www.linkedin.com/in/scharette http://www.ccoderun.ca/

Updated on May 06, 2021

Comments

  • Stéphane
    Stéphane about 3 years

    I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.

    But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:

    Example regular expression:

    .*abc([0-9]+)xyz.*
    

    Example input file:

    a
    b
    c
    abc12345xyz
    a
    b
    c
    

    As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:

    myvalue=$( sed <...something...> input.txt )
    

    Things I've tried include:

    sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
    sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
    
  • Stéphane
    Stéphane over 14 years
    Thanks, but we don't have access to perl, which is why I was asking about sed/awk/gawk.
  • Stéphane
    Stéphane over 14 years
    Interesting... So there isn't a simple way to apply a complex regular expression and get back just what is in the (...) section? Cause while I see what you did here first with grep then with sed, our real situation is much more complex than dropping "abc" and "xyz". The regular expression is used because lots of different text can appear on either side of the text I'd like to extract.
  • paxdiablo
    paxdiablo over 14 years
    I'm sure there is a better way if the REs are really complex. Perhaps if you provided a few more examples or a more detailed description, we could adjust our answers to suit.
  • Stéphane
    Stéphane over 14 years
    Thank you, this worked for me as well once I used * instead of +.
  • Stéphane
    Stéphane over 14 years
    ...and the "p" option to print the the match, which I didn't know about either. Thanks again.
  • SourceSeeker
    SourceSeeker over 14 years
    I had to escape the + and then it worked for me: sed -n 's/^.*abc\([0-9]\+\)xyz.*$/\1/p'
  • Stéphane
    Stéphane over 14 years
    This doesn't seem to work. It prints the entire line instead of the match.
  • ghostdog74
    ghostdog74 over 14 years
    in your sample input file , that pattern is the whole line. right??? if you know the pattern is going to be in a specific field: use $1, $2 etc.. eg gawk '$1 ~ /.*abc([0-9]+)xyz.*/' file
  • SourceSeeker
    SourceSeeker over 14 years
    You can just combine two of your sed commands in this way: sed -n 's/[^0-9]*\([0-9]\+\).*/\1/p'
  • Stéphane
    Stéphane over 14 years
    Previously didn't know about -o option on grep. Nice to know. But it prints the entire match, not the "(...)". So if you are matching on "abc([[:digit:]]+)xyz" then you get the "abc" and "xyz" as well as the digits.
  • anddam
    anddam about 11 years
    That's because you're not using modern RE format therefore + is a standard character and you're supposed to express that with {,} syntax. You can add use -E sed option to trigger modern RE format. Check re_format(7), specifically last paragraph of DESCRIPTION developer.apple.com/library/mac/#documentation/Darwin/Refere‌​nce/…
  • Mark Lakata
    Mark Lakata about 11 years
    This does not output the numeric value ([0-9+]), this outputs the entire line.
  • cincodenada
    cincodenada over 10 years
    A clever, workable solution if you need to (or want to) use gawk. You noted this, but to be clear: non-GNU awk doesn't have gensub(), and therefore doesn't support this.
  • Nik Reiman
    Nik Reiman over 7 years
    This is by far the best, and most well-explained answer so far!
  • fedorqui
    fedorqui over 7 years
    Nice! However, it may be best to use match() to access the captured groups. See my answer for this.
  • r4phG
    r4phG over 6 years
    With some explanation, it's way better to understand what's wrong with our issue. Thank you !
  • Bruno Bronosky
    Bruno Bronosky over 4 years
    Thanks for reminding me of grep -o! I was trying to do this with sed and struggled with my need to find multiple matches on some lines. My solution is stackoverflow.com/a/58308239/117471
  • Jonathan Leffler
    Jonathan Leffler over 3 years
    As well as the -E option, you can use \{1,\} (in place of * or +) to count one or more repeats. You can specify a lower bound or an upper bound or both.
  • Avihai Marchiano
    Avihai Marchiano over 2 years
    1. You don't need both the -n and the /p. You just need one of them. 2. There is no meaning for global, because sed is greedy, so with or without you will get same result for multi occurances: sed -r 's/.*abc([0-9]+)xyz.*/\1/' <<< abc12345xyzabc777xyz AND sed -r 's/.*abc([0-9]+)xyz.*/\1/g' <<< abc12345xyzabc777xyz Both yield: 777
  • Ilia Choly
    Ilia Choly over 2 years
    @AvihaiMarchiano I just tested and it seems like you're right about the /g flag. But removing either -n or /p results in no output being printed for me.