Sed to print only first pattern match of the line

8,826

Solution 1

The .* in the regex pattern is greedy, it matches as long a string as it can, so the quotes that are matched will be the last ones.

Since the separator is only one character here, we can use an inverted bracket group to match anything but a quote, i.e. [^"], and then repeats of that to match a number of characters that aren't quotes.

$ echo '... "foo" ... "bar" ...' | sed 's/[^"]*"\([^"]*\)".*/\1/'
foo

Another way would be to just remove everything up to the first quote, then remove everything starting from the (new) first quote:

$ echo '... "foo" ... "bar" ...' | sed 's/^[^"]*"//; s/".*$//'
foo

In Perl regexes, the * and + specifiers can be made non-greedy by appending a question mark, so .*? would anything, but as few characters/bytes as possible.

Solution 2

I won't bore you with the classic warning against using simple regular expressions to parse HTML. Suffice it to say that you should use a dedicated parser instead. That said, the issue here is that sed uses greedy matching. So it will always match the longest possible string. This means that your .* goes on for ever and matches the entire line.

You could do this in sed (see below), but using a tool that allows non-greedy matches would be simpler:

$ perl -pe 's/.*?"(.*?)".*/$1/' file
data1

Since sed doesn't support non-greedy matches, you need some other trickery. The simplest would be to use the "not quotes" approach in ikkachu's answer. Here's an alternative:

$ rev file | sed 's/.*"\(.*\)".*/\1/' | rev
data1

This just reverses the file (rev), uses your original approach which now works since the 1st occurrence is now the last, and then reverses the file back again.

Solution 3

Here are a couple of ways you could pull out data1 from your input:

grep -oP '^[^"]*"\K[^"]*'

sed -ne '
   /\n/!{y/"/\n/;D;}
   P
'

perl -lne '/"([^"]*)"/ and print($1),last'

Solution 4

While Question is not tagged with awk, but why not using it as it's simply as it is:

awk -F\" '{print $2}' infile.txt

Solution 5

You can also use a non greedy search using perl regular expression's look ahead and look behind:

cat data | grep -Po '(?<=href=").*?(?=")' | head -n1

View more solutions

8,826

Author by

GypsyCosmonaut

Updated on September 20, 2022

Comments

GypsyCosmonaut over 1 year
I have some data like
```
<td><a href="data1">abc</a> ... <a href="data2">abc</a> ... <a href="data3">abc</a>
```
( Would refer to above line as data in code below )

I need data1 in between the first " and " so I do
```
echo 'data' | sed 's/.*"$.*$".*/\1/'
```
but it returns me the last string in between " and " always, i.e in this case it would return me data3 instead instead of data1

In order to get data1, I end up doing
```
echo 'data' | sed 's/.*"$.*$".*".*".*".*".*/\1/'
```
How do I get data1 without this much of redundancy in sed
Stéphane Chazelas over 6 years

cut -d \" -f 2 should be enough
GypsyCosmonaut over 6 years

Your solutions worked, but I really didn't understand the working of [^"]*. Could you please give an explanation or provide a link for it's explanation
GypsyCosmonaut over 6 years

rev is nice until I learn perl regex, thanks
Stéphane Chazelas over 6 years

To complicate things further, see also *+ in perl for super-greedy (won't give anything back, won't backtrack). s/^.*+"(.*?)".*/$1/ would not even match as the .*+ would match the whole line and not backtrack to allow the rest to be matched. s/^[^"]*+"(.*?)".*/ could in theory improve performance though I'd expect perl would know how to optimise it in that case.
Stéphane Chazelas over 6 years

grep -Po '^.*?href="\K.*?(?=")' to return only the first of each line.
Alessio over 6 years

HTML::Parser or HTML::TokeParser are good modules to start with for parsing HTML. and add in LWP if you need to fetch the HTML first (e.g. write a web-scraping robot).
AdminBee over 3 years

Welcome to the site, and thank you for your contribution. Please note, though, that the OP is explicitly looking for a solution to extract string between double-quotes (" ... "). You may want to edit your answer to explain how this can be achieved using your approach.