Non greedy matching using ? with grep
Solution 1
If you have GNU Grep
you can use -P
to make the match non-greedy:
$ tr -d \\012 < price.html | grep -Po '<tr>.*?</tr>'
The -P
option enables Perl Compliant Regular Expression (PCRE) which is needed for non-greedy matching with ?
as Basic Regular Expression (BRE) and Extended Regular Expression (ERE) do not support it.
If you are using -P
you could also use look arounds to avoid printing the tags in the match like so:
$ tr -d \\012 < price.html | grep -Po '(?<=<tr>).*?(?=</tr>)'
If you don't have GNU grep
and the HTML is well formed you could just do:
$ tr -d \\012 < price.html | grep -o '<tr>[^<]*</tr>'
Note: The above example won't work with nested tags within <tr>
.
Solution 2
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep -E
. Use grep -P
instead if you have that, or switch to Perl / Python / Ruby / what have you. (Oh, and pcregrep
.)
Of course, if you really mean
<tr>[^<>]*</tr>
you should say that instead; then plain old grep
will work fine.
You could (tediously) extend the regex to accept nested tags which are not <tr>
but of course, it's better to use a proper HTML parser than spend a lot of time rediscovering why regular expressions are not the right tool for this.
Solution 3
.*?
is a Perl regular expression. Change your grep
to
grep -oP '<tr>.*?</tr>'
Solution 4
Try perl-style-regexp
$ grep -Po '<tr>.*?</tr>' input
<tr>stuff</tr>
<tr>more stuff</tr>
Sven Richter
Updated on June 30, 2022Comments
-
Sven Richter almost 2 years
I'm writing a bash script which analyses a html file and I want to get the content of each single
<tr>...</tr>
. So my command looks like:$ tr -d \\012 < price.html | grep -oE '<tr>.*?</tr>'
But it seems that
grep
gives me the result of:$ tr -d \\012 < price.html | grep -oE '<tr>.*</tr>'
How can I make
.*
non-greedy? -
glenn jackman over 10 yearsOr, if he only wants the contents of the tr tag:
grep -oP '(?<=<tr>).*?(?=</tr>)'
-- using look-arounds to omit the actual tags -
glenn jackman over 10 yearsThe last example (using "[^<]*" is unlikely to work since there will inevitably be "td" or "th" tags within "tr".
-
Chris Seymour over 10 years@glennjackman good point, I will leave it in the answers however as the general principle might be useful to on lookers.