Sed to print only first pattern match of the line

8,826

Solution 1

The .* in the regex pattern is greedy, it matches as long a string as it can, so the quotes that are matched will be the last ones.

Since the separator is only one character here, we can use an inverted bracket group to match anything but a quote, i.e. [^"], and then repeats of that to match a number of characters that aren't quotes.

$ echo '... "foo" ... "bar" ...' | sed 's/[^"]*"\([^"]*\)".*/\1/'
foo

Another way would be to just remove everything up to the first quote, then remove everything starting from the (new) first quote:

$ echo '... "foo" ... "bar" ...' | sed 's/^[^"]*"//; s/".*$//'
foo

In Perl regexes, the * and + specifiers can be made non-greedy by appending a question mark, so .*? would anything, but as few characters/bytes as possible.

Solution 2

I won't bore you with the classic warning against using simple regular expressions to parse HTML. Suffice it to say that you should use a dedicated parser instead. That said, the issue here is that sed uses greedy matching. So it will always match the longest possible string. This means that your .* goes on for ever and matches the entire line.

You could do this in sed (see below), but using a tool that allows non-greedy matches would be simpler:

$ perl -pe 's/.*?"(.*?)".*/$1/' file
data1

Since sed doesn't support non-greedy matches, you need some other trickery. The simplest would be to use the "not quotes" approach in ikkachu's answer. Here's an alternative:

$ rev file | sed 's/.*"\(.*\)".*/\1/' | rev
data1

This just reverses the file (rev), uses your original approach which now works since the 1st occurrence is now the last, and then reverses the file back again.

Solution 3

Here are a couple of ways you could pull out data1 from your input:

grep -oP '^[^"]*"\K[^"]*'

sed -ne '
   /\n/!{y/"/\n/;D;}
   P
'

perl -lne '/"([^"]*)"/ and print($1),last'

Solution 4

While Question is not tagged with awk, but why not using it as it's simply as it is:

awk -F\" '{print $2}' infile.txt 

Solution 5

You can also use a non greedy search using perl regular expression's look ahead and look behind:

cat data | grep -Po '(?<=href=").*?(?=")' | head -n1
Share:
8,826
GypsyCosmonaut
Author by

GypsyCosmonaut

Updated on September 20, 2022

Comments

  • GypsyCosmonaut
    GypsyCosmonaut over 1 year

    I have some data like

    <td><a href="data1">abc</a> ... <a href="data2">abc</a> ... <a href="data3">abc</a>
    

    ( Would refer to above line as data in code below )

    I need data1 in between the first " and " so I do

    echo 'data' | sed 's/.*"\(.*\)".*/\1/'
    

    but it returns me the last string in between " and " always, i.e in this case it would return me data3 instead instead of data1

    In order to get data1, I end up doing

    echo 'data' | sed 's/.*"\(.*\)".*".*".*".*".*/\1/'
    

    How do I get data1 without this much of redundancy in sed

  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    cut -d \" -f 2 should be enough
  • GypsyCosmonaut
    GypsyCosmonaut over 6 years
    Your solutions worked, but I really didn't understand the working of [^"]*. Could you please give an explanation or provide a link for it's explanation
  • GypsyCosmonaut
    GypsyCosmonaut over 6 years
    rev is nice until I learn perl regex, thanks
  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    To complicate things further, see also *+ in perl for super-greedy (won't give anything back, won't backtrack). s/^.*+"(.*?)".*/$1/ would not even match as the .*+ would match the whole line and not backtrack to allow the rest to be matched. s/^[^"]*+"(.*?)".*/ could in theory improve performance though I'd expect perl would know how to optimise it in that case.
  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    grep -Po '^.*?href="\K.*?(?=")' to return only the first of each line.
  • Alessio
    Alessio over 6 years
    HTML::Parser or HTML::TokeParser are good modules to start with for parsing HTML. and add in LWP if you need to fetch the HTML first (e.g. write a web-scraping robot).
  • AdminBee
    AdminBee over 3 years
    Welcome to the site, and thank you for your contribution. Please note, though, that the OP is explicitly looking for a solution to extract string between double-quotes (" ... "). You may want to edit your answer to explain how this can be achieved using your approach.

Related