Easiest way to extract the urls from an html page using sed or awk only
96,811
Solution 1
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of @condit)
lynx -dump -hiddenlinks=listonly my.html
Solution 2
You asked for it:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
Solution 3
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
- The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
- The first sed will add a newline in front of each a href url tag with the \n
- The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
- The 2nd grep href cleans the mess up
- The sort and uniq will give you one instance of each existing url present in the sourcepage.html
Solution 4
With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/@href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/
Solution 5
I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
- We are matching cases where the anchor doesn't start with href as first attribute
- We are covering the possibility of having several anchors in the same line
Author by
codaddict
Name : Abhijit Rao Company : Microsoft, Mountain view, CA
Updated on July 08, 2022Comments
-
codaddict almost 2 years
I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
-
ghostdog74 over 14 yearscomplicated, and fails when href is like this: ... HREF="somewhere.com" ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ...
-
Ralph M. Rickenbach over 14 yearsDoes this work for '<a href="aktuell.de.selfhtml.org" target="_blank">SELFHTML aktuell</a>'
-
nes1983 over 14 yearsI tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar.
-
ghostdog74 over 14 yearsif i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see.
-
monksy about 12 yearsYou don't need to have a cat before the grep. Just put f.html at the end of grep
-
Crisboot almost 12 yearsAlmost perfect, but what about this two cases: 1. You are matching only the ones that start with <a href <a title="Title" href="sample">Match me</a> 2. What if there's two anchors in the same line I made this modifications to the original solution:
code
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a/\n<a/g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'code
-
Crisboot almost 12 yearsBut at least it solves the problem, none of the other solutions does
-
Jeremy J Starcher over 11 yearsNice break down of what each step should do.
-
kisp over 10 yearsAnd grep -o can fail due to a bug in some versions of grep.
-
condit about 10 yearsIn Lynx 2.8.8 this has become
lynx -dump -hiddenlinks=listonly my.html
-
SomniusX almost 10 yearsthis really did the work, many great thanx for this great awk bundle!
-
arjan about 8 yearsThis is the easiest and simplest answer. Just do e.g.
wget http://sed.sourceforge.net/grabbag/scripts/list_urls.sed -O ~/bin/list_urls.sed && chmod +x ~/bin/list_urls.sed
to get the script, and thenwget http://www.example.com -O - | ~/bin/list_urls.sed > example.com.urls.txt
to get the urls in a text file! -
Raúl Salinas-Monteagudo almost 7 yearsI strongly suggest to use a pipeline instead of temporary files: lynx -listonly -dump "$url" | awk 'FNR > 2 {print$2}'
-
smihael over 6 yearsconcat expects 2 arguments but here only one (base url is given). err:XPST0017: unknown function: concat #1 Did you mean: In module w3.org/2005/xpath-functions: concat #2-65535
-
Ingo Karkat over 6 years@smihael: You're right, that's superfluous here. Removed it. Thanks for noticing!
-
simon about 6 yearsThe best option here if you don't want to use Lynx and your anchors don't start with <a href...
-
Roman Chernyatchik over 5 yearsthanks, works on Mac compared to many other solutions mentioned above