Easiest way to extract the urls from an html page using sed or awk only

html regex bash sed awk

96,811

Solution 1

You could also do something like this (provided you have lynx installed)...

Lynx versions < 2.8.8

lynx -dump -listonly my.html

Lynx versions >= 2.8.8 (courtesy of @condit)

lynx -dump -hiddenlinks=listonly my.html

Solution 2

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

Solution 3

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq

The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html

Solution 4

With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

Solution 5

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line

View more solutions

96,811

Author by

codaddict

Name : Abhijit Rao Company : Microsoft, Mountain view, CA

Updated on July 08, 2022

Comments

codaddict almost 2 years

I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

What is the easiest way to do this?
ghostdog74 over 14 years

complicated, and fails when href is like this: ... HREF="somewhere.com" ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ...
Ralph M. Rickenbach over 14 years

Does this work for '<a href="aktuell.de.selfhtml.org" target="_blank">SELFHTML aktuell</a>'
nes1983 over 14 years

I tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar.
ghostdog74 over 14 years

if i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see.
monksy about 12 years

You don't need to have a cat before the grep. Just put f.html at the end of grep
Crisboot almost 12 years

Almost perfect, but what about this two cases: 1. You are matching only the ones that start with <a href <a title="Title" href="sample">Match me</a> 2. What if there's two anchors in the same line I made this modifications to the original solution: code cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a/\n<a/g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d' code
Crisboot almost 12 years

But at least it solves the problem, none of the other solutions does
Jeremy J Starcher over 11 years

Nice break down of what each step should do.
kisp over 10 years

And grep -o can fail due to a bug in some versions of grep.
condit about 10 years

In Lynx 2.8.8 this has become lynx -dump -hiddenlinks=listonly my.html
SomniusX almost 10 years

this really did the work, many great thanx for this great awk bundle!
arjan about 8 years

This is the easiest and simplest answer. Just do e.g. wget http://sed.sourceforge.net/grabbag/scripts/list_urls.sed -O ~/bin/list_urls.sed && chmod +x ~/bin/list_urls.sed to get the script, and then wget http://www.example.com -O - | ~/bin/list_urls.sed > example.com.urls.txt to get the urls in a text file!
Raúl Salinas-Monteagudo almost 7 years

I strongly suggest to use a pipeline instead of temporary files: lynx -listonly -dump "$url" | awk 'FNR > 2 {print$2}'
smihael over 6 years

concat expects 2 arguments but here only one (base url is given). err:XPST0017: unknown function: concat #1 Did you mean: In module w3.org/2005/xpath-functions: concat #2-65535
Ingo Karkat over 6 years

@smihael: You're right, that's superfluous here. Removed it. Thanks for noticing!
simon about 6 years

The best option here if you don't want to use Lynx and your anchors don't start with <a href...
Roman Chernyatchik over 5 years

thanks, works on Mac compared to many other solutions mentioned above