Why is sed not working?

command-line text-processing sed regex

27,782

Solution 1

Because you are using PCRE (Perl Compatible Regular Expressions) syntax and sed doesn't understand that, it uses Basic Regular Expressions (BRE) by default. It knows neither \s nor \d. You are also escaping all sorts of things that don't need to be escaped (neither the \= nor the \> are doing anything useful) while not escaping things that do need to be escaped (+ just means the symbol + in BRE, you need \+ for "one or more".

This should do what you need:

sed 's/" width="[0-9]\+">//g' file

Or, using Extended Regular Expressions:

sed -E 's/"\s*width="[0-9]+">//g' file

Finally, as a general rule you never use sed -i without first testing without the -i to be sure it works or, if you do, at least use -i.bak (-i with any text will do this) to create a backup.

Solution 2

Here is my sed solution:

sed -E 's/(.*)" width="[0-9]+">/\1/' filename

And as an alternative to the sed I suggest using grep to extract data from a file:

This would work for you:

grep -o "website.*\.gif" filename

And as terdon suggested, here is a look ahead solution using grep:

grep -Po '.*(?="\swidth="\d*">)' filename

Also cut is a good option in your situation:

cut -f1 -d'"' filename

Solution 3

Or for a shorter exchange simply remove everything after gif

sed 's/gif.*/gif/' file

The .* matches any number of any characters, as long as what you want to lose is always after a string that you can locate... and that there are no other instances of it in a line. It would match website.com/path/to/gif/xyz.gif" width..." on the earlier gif, so give undesired results.

27,782

Andrew Pullins

Updated on September 18, 2022

Comments

Andrew Pullins over 1 year
I have some HTML that I am trying to extract links from. Right now the file looks like this.
```
website.com/path/to/file/234432517.gif" width="620">
website.com/path/to/file/143743e53.gif" width="620">
website.com/path/to/file/123473232.gif" width="620">
website.com/path/to/file/634132317.gif" width="620">
website.com/path/to/file/432432173.gif" width="620">
```
I am trying to use sed to remove the " width="620"> from all the lines. Here is my sed code:
```
sudo sed -i "s/\"\swidth\=\"\d+\"\>//g" output
```
Why is this not working? everything I google leads to some code that looks like this but this does not work for some reason.
- Aaron almost 7 years
  
  Another solution to your problem would be to use cut : cut -d'"' -f1 will return the first field separated by ", that is the gif url. Assuming the url is fixed length, cut -c 1-38 would also work, returning the 38 first characters of each line that compose the url.
- David Foerster almost 7 years
  
  Don't parse XML (or other context-free grammars) with regular expressions!
- terdon almost 7 years
  
  @DavidFoerster strictly speaking, this is just ASCII text. The example shown isn't even valid HTML and it is absolutely simple enough that regular expressions can indeed deal with it. While we're all fond of that gem of an answer and it is absolutely true in general, it doesn't mean that you can never parse any XML/HTML-like data with simple tools. Only that it is usually a bad idea unless you're Tom Christiansen.
pLumo almost 7 years

Good Idea. But -P is not needed in that case, and I would use [^\"]* instead of .*\.gif. That would be less specific.
Ravexina almost 7 years

Yeah, edited... I was testing something which didn't worked ;)
terdon almost 7 years

A better approach using grep would be grep -oP '.*(?=" width="\d+">)' file since that i) doesn't assume the presence of any string (like "website" in your example) not mentioned by the OP and ii) uses the same basic idea as the OP so we can be sure it will match their data.
Ravexina almost 7 years

@terdon I'll add a look ahead solution right now ;) thanks.
pLumo almost 7 years

There are hundreds of working solutions with grep and sed. I love it.
vaquito almost 7 years

Generally speaking, if you have a choice for these things go with Perl as it has one of the most powerful regular expression engines available in a command line tool.
Andrew Pullins almost 7 years

Oh I did not know there were different regular expression languages. I just filled my data into regexer.com did made up the RegEx and assumed it would work. Thanks.