Why is sed not working?
Solution 1
Because you are using PCRE (Perl Compatible Regular Expressions) syntax and sed
doesn't understand that, it uses Basic Regular Expressions (BRE) by default. It knows neither \s
nor \d
. You are also escaping all sorts of things that don't need to be escaped (neither the \=
nor the \>
are doing anything useful) while not escaping things that do need to be escaped (+
just means the symbol +
in BRE, you need \+
for "one or more".
This should do what you need:
sed 's/" width="[0-9]\+">//g' file
Or, using Extended Regular Expressions:
sed -E 's/"\s*width="[0-9]+">//g' file
Finally, as a general rule you never use sed -i
without first testing without the -i
to be sure it works or, if you do, at least use -i.bak
(-i
with any text will do this) to create a backup.
Solution 2
Here is my sed
solution:
sed -E 's/(.*)" width="[0-9]+">/\1/' filename
And as an alternative to the sed
I suggest using grep
to extract data from a file:
This would work for you:
grep -o "website.*\.gif" filename
And as terdon suggested, here is a look ahead solution using grep
:
grep -Po '.*(?="\swidth="\d*">)' filename
Also cut
is a good option in your situation:
cut -f1 -d'"' filename
Solution 3
Or for a shorter exchange simply remove everything after gif
sed 's/gif.*/gif/' file
The .*
matches any number of any characters, as long as what you want to lose is always after a string that you can locate... and that there are no other instances of it in a line. It would match website.com/path/to/gif/xyz.gif" width..."
on the earlier gif
, so give undesired results.
Related videos on Youtube
Andrew Pullins
Updated on September 18, 2022Comments
-
Andrew Pullins over 1 year
I have some HTML that I am trying to extract links from. Right now the file looks like this.
website.com/path/to/file/234432517.gif" width="620"> website.com/path/to/file/143743e53.gif" width="620"> website.com/path/to/file/123473232.gif" width="620"> website.com/path/to/file/634132317.gif" width="620"> website.com/path/to/file/432432173.gif" width="620">
I am trying to use sed to remove the
" width="620">
from all the lines. Here is my sed code:sudo sed -i "s/\"\swidth\=\"\d+\"\>//g" output
Why is this not working? everything I google leads to some code that looks like this but this does not work for some reason.
-
Aaron almost 7 yearsAnother solution to your problem would be to use
cut
:cut -d'"' -f1
will return the first field separated by"
, that is the gif url. Assuming the url is fixed length,cut -c 1-38
would also work, returning the 38 first characters of each line that compose the url. -
David Foerster almost 7 years
-
terdon almost 7 years@DavidFoerster strictly speaking, this is just ASCII text. The example shown isn't even valid HTML and it is absolutely simple enough that regular expressions can indeed deal with it. While we're all fond of that gem of an answer and it is absolutely true in general, it doesn't mean that you can never parse any XML/HTML-like data with simple tools. Only that it is usually a bad idea unless you're Tom Christiansen.
-
-
pLumo almost 7 yearsGood Idea. But
-P
is not needed in that case, and I would use[^\"]*
instead of.*\.gif
. That would be less specific. -
Ravexina almost 7 yearsYeah, edited... I was testing something which didn't worked ;)
-
terdon almost 7 yearsA better approach using
grep
would begrep -oP '.*(?=" width="\d+">)' file
since that i) doesn't assume the presence of any string (like "website" in your example) not mentioned by the OP and ii) uses the same basic idea as the OP so we can be sure it will match their data. -
Ravexina almost 7 years@terdon I'll add a look ahead solution right now ;) thanks.
-
pLumo almost 7 yearsThere are hundreds of working solutions with
grep
andsed
. I love it. -
vaquito almost 7 yearsGenerally speaking, if you have a choice for these things go with Perl as it has one of the most powerful regular expression engines available in a command line tool.
-
Andrew Pullins almost 7 yearsOh I did not know there were different regular expression languages. I just filled my data into regexer.com did made up the RegEx and assumed it would work. Thanks.