How to extract multiple pattern from each line with sed, awk, or grep

text-processing awk sed grep

10,931

Solution 1

Assumptions

The input is a text file that contains strings (sequences of non-blank characters) separated by sequences of blank characters. Each line contains a specific word (known at runtime) followed (not necessarily immediately) by a string that is a number in the fashion of a version number. (Apparently this means only that it begins with a digit.)

It must be possible to specify the word to look for as a parameter at runtime. For example, to search for the word tech, we should be able to say something like

word=tech

and let the command (or script) use $word. The word should be matched exactly; e.g., “technology”, “nanotech” and “Tech” should not be matched. The word should contain only letters, digits, and _ (underscore) — punctuation characters, and, especially, characters that are special in regular expressions — may produce undesired results. For each qualifying line, the command should output the word and the number, separated by a space (and nothing else). If the file contains lines that do not conform to these assumptions (for example, not containing the desired word or any number), behavior is undefined. In particular, such non-conforming lines may simply be ignored.

For all the below commands, $word will be assumed to be defined as described above.

Note: Each of these commands can be formulated in different ways. In some cases, the differences are trivial.

`grep`

plain `grep`

I couldn’t figure out how to do this.

plain `grep` with an assist

The command

grep    "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"

will match every line that contains either the word (\<$word\>) or (\|) a number (\<[[:digit:]][[:graph:]]*\>). ([[:graph:]] means a letter, digit or punctuation character; i.e., anything other than a blank.) The output from this command in --color mode is slightly interesting:

grep -o "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"

outputs each matching string — and only the matching strings — on separate lines:

tech
1.2
tech
1
tech
0.1
tech
10.1.3
tech
7.5
tech
8.0
tech
0.x
tech
1.3.x
tech
5.x
tech
2.0.4x

So then we do

grep -o "\<$word\>\|\<[[:digit:]][[:graph:]]*\>" (input_file) | sed "/$word/ { N; s/\n/ / }"

to take the above output and join each line containing the word (tech) with the following line (separating them with a space):

tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x

`pcregrep`

pcregrep -o1 -o2 --om-separator=' ' "\b($word)\b.*?\b(\d\S*)"

matches the word and a number (\b is a word boundary, \d is a digit, and \S is any character other than a space), capturing each of them in a (…) group. Then it uses -o to output only the matching strings — but, in pcregrep, you can say -o1 -o2 to output capture groups 1 and 2. The --om-separator=' ', obviously, specifies what to put between the strings.

Note: since this uses .*? (non-greedy match), if there are multiple numbers in the input line, this will find the first one. The other commands will find the last one.

`sed`

sed -n "s/.*\(\<$word\>\).*[[:blank:]]\(\<[[:digit:]][[:graph:]]*\).*/\1 \2/p"

Similar to the pcregrep command, this matches the strings in capture groups and then outputs them as \1 \2.

`awk`

awk -v the_word="$word" '
    {
        w=0             # Index of word
        n=0             # Index of number
        for (i=0; i<=NF; i++) {
                if ($i == the_word) w=i
                if (substr($i,1,1) ~ /[[:digit:]]/) n=i
        }
        if (w>0  &&  n>w) print $w, $n
    }'

This looks for the word (the_word) and a number (a string whose first character is a digit). If it finds them both, in that order, it prints them both.

Note: this will recognize the word only if it is fully free-standing. The other commands will match it if it touches punctuation; e.g.,

The cyber clock goes tech, tock …
This contains the word (tech) …

Solution 2

The following code should do as desired.

searchword="tech"
(cat << EOF
tech -this is sample and ... 1.2 sample
bla bla bla test tech 1 one two
foo bar foo bar tech and [email protected] 0.1
bar bar bar foo foo tech 10.1.3
tech 7.5 and continue this line
sample_test one two tech 8.0
one two there tech 0.x *sample bla bla
www sample tech 1.3.x foo bar $foo #bar
foobar barbar bla bla tech 5.x
tech testtest 2.0.4x this is sample
EOF
) | grep $searchword |\
grep -o '\b[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9][0-9]*\.[0-9x][0-9x]*\b' |\
sed "s/^/$searchword /"

will yield you

tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x

at least with

bash GNU bash, version 4.4.5(1)-release
sed sed (GNU sed) 4.2.2
grep grep (GNU grep) 2.27

I'd be glad it this answer helped you, or else you would consider being more explicit and more explaining in your question

10,931

alrz

Updated on September 18, 2022

Comments

alrz over 1 year
I have a text file that looks like this:
```
tech -this is sample and ... 1.2 sample
bla bla bla test tech 1 one two
foo bar foo bar tech and [email protected] 0.1
bar bar bar foo foo tech 10.1.3
tech 7.5 and continue this line
sample_test one two tech 8.0
one two there tech 0.x *sample bla bla
www sample tech 1.3.x foo bar $foo #bar
foobar barbar bla bla tech 5.x
tech testtest 2.0.4x this is sample
```
I want to extract sample text — a word like tech and a number pattern like this 7.5 and other number patterns.

(actually the number patterns are version-control style version numbers)

and then get output as follows:
```
tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x
```
- jordanm over 7 years
  
  It's not clear what the significance or pattern is with the word tech. Why not just hard code it? How do you know it's the word you are looking for?
- alrz over 7 years
  
  word <tech> is a example. Suppose word <apache>. actually word is not a pattern. it is exactly a word
- G-Man Says 'Reinstate Monica' over 7 years
  
  (1) That comment is as clear as mud. Please explain better. (2) Please explain where the 0.x in the seventh line comes from.
- Sergiy Kolodyazhnyy over 7 years
  
  What about one two there tech *sample bla bla ? There's tech but no number. What should be done in that case ?
- alrz over 7 years
  
  sorry you are right.