Sed to print only first pattern match of the line
Solution 1
The .*
in the regex pattern is greedy, it matches as long a string as it can, so the quotes that are matched will be the last ones.
Since the separator is only one character here, we can use an inverted bracket group to match anything but a quote, i.e. [^"]
, and then repeats of that to match a number of characters that aren't quotes.
$ echo '... "foo" ... "bar" ...' | sed 's/[^"]*"\([^"]*\)".*/\1/'
foo
Another way would be to just remove everything up to the first quote, then remove everything starting from the (new) first quote:
$ echo '... "foo" ... "bar" ...' | sed 's/^[^"]*"//; s/".*$//'
foo
In Perl regexes, the *
and +
specifiers can be made non-greedy by appending a question mark, so .*?
would anything, but as few characters/bytes as possible.
Solution 2
I won't bore you with the classic warning against using simple regular expressions to parse HTML. Suffice it to say that you should use a dedicated parser instead. That said, the issue here is that sed
uses greedy matching. So it will always match the longest possible string. This means that your .*
goes on for ever and matches the entire line.
You could do this in sed
(see below), but using a tool that allows non-greedy matches would be simpler:
$ perl -pe 's/.*?"(.*?)".*/$1/' file
data1
Since sed
doesn't support non-greedy matches, you need some other trickery. The simplest would be to use the "not quotes" approach in ikkachu's answer. Here's an alternative:
$ rev file | sed 's/.*"\(.*\)".*/\1/' | rev
data1
This just reverses the file (rev
), uses your original approach which now works since the 1st occurrence is now the last, and then reverses the file back again.
Solution 3
Here are a couple of ways you could pull out data1 from your input:
grep -oP '^[^"]*"\K[^"]*'
sed -ne '
/\n/!{y/"/\n/;D;}
P
'
perl -lne '/"([^"]*)"/ and print($1),last'
Solution 4
While Question is not tagged with awk
, but why not using it as it's simply as it is:
awk -F\" '{print $2}' infile.txt
Solution 5
You can also use a non greedy search using perl regular expression's look ahead and look behind:
cat data | grep -Po '(?<=href=").*?(?=")' | head -n1
GypsyCosmonaut
Updated on September 20, 2022Comments
-
GypsyCosmonaut over 1 year
I have some data like
<td><a href="data1">abc</a> ... <a href="data2">abc</a> ... <a href="data3">abc</a>
( Would refer to above line as
data
in code below )I need
data1
in between the first"
and"
so I doecho 'data' | sed 's/.*"\(.*\)".*/\1/'
but it returns me the last string in between
"
and"
always, i.e in this case it would return medata3
instead instead ofdata1
In order to get
data1
, I end up doingecho 'data' | sed 's/.*"\(.*\)".*".*".*".*".*/\1/'
How do I get
data1
without this much of redundancy insed
-
Stéphane Chazelas over 6 years
cut -d \" -f 2
should be enough -
GypsyCosmonaut over 6 yearsYour solutions worked, but I really didn't understand the working of
[^"]*
. Could you please give an explanation or provide a link for it's explanation -
GypsyCosmonaut over 6 years
rev
is nice until I learnperl
regex, thanks -
Stéphane Chazelas over 6 yearsTo complicate things further, see also
*+
inperl
for super-greedy (won't give anything back, won't backtrack).s/^.*+"(.*?)".*/$1/
would not even match as the.*+
would match the whole line and not backtrack to allow the rest to be matched.s/^[^"]*+"(.*?)".*/
could in theory improve performance though I'd expectperl
would know how to optimise it in that case. -
Stéphane Chazelas over 6 years
grep -Po '^.*?href="\K.*?(?=")'
to return only the first of each line. -
Alessio over 6 years
HTML::Parser
orHTML::TokeParser
are good modules to start with for parsing HTML. and add inLWP
if you need to fetch the HTML first (e.g. write a web-scraping robot). -
AdminBee over 3 yearsWelcome to the site, and thank you for your contribution. Please note, though, that the OP is explicitly looking for a solution to extract string between double-quotes (
" ... "
). You may want to edit your answer to explain how this can be achieved using your approach.