Non greedy matching using ? with grep

11,294

Solution 1

If you have GNU Grep you can use -P to make the match non-greedy:

$ tr -d \\012 < price.html | grep -Po '<tr>.*?</tr>'

The -P option enables Perl Compliant Regular Expression (PCRE) which is needed for non-greedy matching with ? as Basic Regular Expression (BRE) and Extended Regular Expression (ERE) do not support it.

If you are using -P you could also use look arounds to avoid printing the tags in the match like so:

$ tr -d \\012 < price.html | grep -Po '(?<=<tr>).*?(?=</tr>)'

If you don't have GNU grep and the HTML is well formed you could just do:

$ tr -d \\012 < price.html | grep -o '<tr>[^<]*</tr>'

Note: The above example won't work with nested tags within <tr>.

Solution 2

Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep -E. Use grep -P instead if you have that, or switch to Perl / Python / Ruby / what have you. (Oh, and pcregrep.)

Of course, if you really mean

<tr>[^<>]*</tr>

you should say that instead; then plain old grep will work fine.

You could (tediously) extend the regex to accept nested tags which are not <tr> but of course, it's better to use a proper HTML parser than spend a lot of time rediscovering why regular expressions are not the right tool for this.

Solution 3

.*? is a Perl regular expression. Change your grep to

grep -oP '<tr>.*?</tr>'

Solution 4

Try perl-style-regexp

$ grep -Po '<tr>.*?</tr>' input
<tr>stuff</tr>
<tr>more stuff</tr>
Share:
11,294
Sven Richter
Author by

Sven Richter

Updated on June 30, 2022

Comments

  • Sven Richter
    Sven Richter almost 2 years

    I'm writing a bash script which analyses a html file and I want to get the content of each single <tr>...</tr>. So my command looks like:

    $ tr -d \\012 < price.html | grep -oE '<tr>.*?</tr>'
    

    But it seems that grep gives me the result of:

    $ tr -d \\012 < price.html | grep -oE '<tr>.*</tr>'
    

    How can I make .* non-greedy?

  • glenn jackman
    glenn jackman over 10 years
    Or, if he only wants the contents of the tr tag: grep -oP '(?<=<tr>).*?(?=</tr>)' -- using look-arounds to omit the actual tags
  • glenn jackman
    glenn jackman over 10 years
    The last example (using "[^<]*" is unlikely to work since there will inevitably be "td" or "th" tags within "tr".
  • Chris Seymour
    Chris Seymour over 10 years
    @glennjackman good point, I will leave it in the answers however as the general principle might be useful to on lookers.