How to extract multiple pattern from each line with sed, awk, or grep
Solution 1
Assumptions
The input is a text file that contains strings (sequences of non-blank characters) separated by sequences of blank characters. Each line contains a specific word (known at runtime) followed (not necessarily immediately) by a string that is a number in the fashion of a version number. (Apparently this means only that it begins with a digit.)
It must be possible to specify the word to look for as a parameter at runtime. For example, to search for the word tech, we should be able to say something like
word=tech
and let the command (or script) use $word
.
The word should be matched exactly;
e.g., “technology”, “nanotech” and “Tech” should not be matched.
The word should contain only letters, digits, and _
(underscore) —
punctuation characters, and, especially,
characters that are special in regular expressions —
may produce undesired results.
For each qualifying line,
the command should output the word and the number,
separated by a space (and nothing else).
If the file contains lines that do not conform to these assumptions
(for example, not containing the desired word or any number),
behavior is undefined.
In particular, such non-conforming lines may simply be ignored.
For all the below commands,
$word
will be assumed to be defined as described above.
Note: Each of these commands can be formulated in different ways. In some cases, the differences are trivial.
grep
plain grep
I couldn’t figure out how to do this.
plain grep
with an assist
The command
grep "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"
will match every line that contains either the word (\<$word\>
)
or (\|
) a number (\<[[:digit:]][[:graph:]]*\>
).
([[:graph:]]
means a letter, digit or punctuation character;
i.e., anything other than a blank.)
The output from this command in --color
mode is slightly interesting:
grep -o "\<$word\>\|\<[[:digit:]][[:graph:]]*\>"
outputs each matching string — and only the matching strings — on separate lines:
tech
1.2
tech
1
tech
0.1
tech
10.1.3
tech
7.5
tech
8.0
tech
0.x
tech
1.3.x
tech
5.x
tech
2.0.4x
So then we do
grep -o "\<$word\>\|\<[[:digit:]][[:graph:]]*\>" (input_file) | sed "/$word/ { N; s/\n/ / }"to take the above output and join each line containing the word (tech) with the following line (separating them with a space):
tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x
pcregrep
pcregrep -o1 -o2 --om-separator=' ' "\b($word)\b.*?\b(\d\S*)"
matches the word and a number (\b
is a word boundary,
\d
is a digit, and \S
is any character other than a space),
capturing each of them in a (
…)
group.
Then it uses -o
to output only the matching strings —
but, in pcregrep
, you can say -o1 -o2
to output capture groups 1 and 2.
The --om-separator=' '
, obviously,
specifies what to put between the strings.
Note: since this uses .*?
(non-greedy match),
if there are multiple numbers in the input line,
this will find the first one.
The other commands will find the last one.
sed
sed -n "s/.*\(\<$word\>\).*[[:blank:]]\(\<[[:digit:]][[:graph:]]*\).*/\1 \2/p"
Similar to the pcregrep
command,
this matches the strings in capture groups and then outputs them as \1 \2
.
awk
awk -v the_word="$word" '
{
w=0 # Index of word
n=0 # Index of number
for (i=0; i<=NF; i++) {
if ($i == the_word) w=i
if (substr($i,1,1) ~ /[[:digit:]]/) n=i
}
if (w>0 && n>w) print $w, $n
}'
This looks for the word (the_word
)
and a number (a string whose first character is a digit).
If it finds them both, in that order, it prints them both.
Note: this will recognize the word only if it is fully free-standing. The other commands will match it if it touches punctuation; e.g.,
The cyber clock goes tech, tock …
This contains the word (tech) …
Solution 2
The following code should do as desired.
searchword="tech"
(cat << EOF
tech -this is sample and ... 1.2 sample
bla bla bla test tech 1 one two
foo bar foo bar tech and [email protected] 0.1
bar bar bar foo foo tech 10.1.3
tech 7.5 and continue this line
sample_test one two tech 8.0
one two there tech 0.x *sample bla bla
www sample tech 1.3.x foo bar $foo #bar
foobar barbar bla bla tech 5.x
tech testtest 2.0.4x this is sample
EOF
) | grep $searchword |\
grep -o '\b[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9x][0-9x]*\b\|\b[0-9][0-9]*\.[0-9][0-9]*\.[0-9x][0-9x]*\b' |\
sed "s/^/$searchword /"
will yield you
tech 1.2
tech 1
tech 0.1
tech 10.1.3
tech 7.5
tech 8.0
tech 0.x
tech 1.3.x
tech 5.x
tech 2.0.4x
at least with
- bash
GNU bash, version 4.4.5(1)-release
- sed
sed (GNU sed) 4.2.2
- grep
grep (GNU grep) 2.27
I'd be glad it this answer helped you, or else you would consider being more explicit and more explaining in your question
Related videos on Youtube
alrz
Updated on September 18, 2022Comments
-
alrz over 1 year
I have a text file that looks like this:
tech -this is sample and ... 1.2 sample bla bla bla test tech 1 one two foo bar foo bar tech and [email protected] 0.1 bar bar bar foo foo tech 10.1.3 tech 7.5 and continue this line sample_test one two tech 8.0 one two there tech 0.x *sample bla bla www sample tech 1.3.x foo bar $foo #bar foobar barbar bla bla tech 5.x tech testtest 2.0.4x this is sample
I want to extract sample text — a word like tech and a number pattern like this 7.5 and other number patterns.
(actually the number patterns are version-control style version numbers)
and then get output as follows:
tech 1.2 tech 1 tech 0.1 tech 10.1.3 tech 7.5 tech 8.0 tech 0.x tech 1.3.x tech 5.x tech 2.0.4x
-
jordanm over 7 yearsIt's not clear what the significance or pattern is with the word
tech
. Why not just hard code it? How do you know it's the word you are looking for? -
alrz over 7 yearsword <tech> is a example. Suppose word <apache>. actually word is not a pattern. it is exactly a word
-
G-Man Says 'Reinstate Monica' over 7 years(1) That comment is as clear as mud. Please explain better. (2) Please explain where the
0.x
in the seventh line comes from. -
Sergiy Kolodyazhnyy over 7 yearsWhat about
one two there tech *sample bla bla
? There'stech
but no number. What should be done in that case ? -
alrz over 7 yearssorry you are right.
-