Easiest way to extract the urls from an html page using sed or awk only

96,811

Solution 1

You could also do something like this (provided you have lynx installed)...

Lynx versions < 2.8.8

lynx -dump -listonly my.html

Lynx versions >= 2.8.8 (courtesy of @condit)

lynx -dump -hiddenlinks=listonly my.html

Solution 2

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

Solution 3

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
  1. The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
  2. The first sed will add a newline in front of each a href url tag with the \n
  3. The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
  4. The 2nd grep href cleans the mess up
  5. The sort and uniq will give you one instance of each existing url present in the sourcepage.html

Solution 4

With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

Solution 5

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

  1. We are matching cases where the anchor doesn't start with href as first attribute
  2. We are covering the possibility of having several anchors in the same line
Share:
96,811
codaddict
Author by

codaddict

Name         : Abhijit Rao Company   : Microsoft, Mountain view, CA

Updated on July 08, 2022

Comments

  • codaddict
    codaddict almost 2 years

    I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

    What is the easiest way to do this?

  • ghostdog74
    ghostdog74 over 14 years
    complicated, and fails when href is like this: ... HREF="somewhere.com" ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ...
  • Ralph M. Rickenbach
    Ralph M. Rickenbach over 14 years
    Does this work for '<a href="aktuell.de.selfhtml.org" target="_blank">SELFHTML aktuell</a>'
  • nes1983
    nes1983 over 14 years
    I tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar.
  • ghostdog74
    ghostdog74 over 14 years
    if i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see.
  • monksy
    monksy about 12 years
    You don't need to have a cat before the grep. Just put f.html at the end of grep
  • Crisboot
    Crisboot almost 12 years
    Almost perfect, but what about this two cases: 1. You are matching only the ones that start with <a href <a title="Title" href="sample">Match me</a> 2. What if there's two anchors in the same line I made this modifications to the original solution: code cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a/\n<a/g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d' code
  • Crisboot
    Crisboot almost 12 years
    But at least it solves the problem, none of the other solutions does
  • Jeremy J Starcher
    Jeremy J Starcher over 11 years
    Nice break down of what each step should do.
  • kisp
    kisp over 10 years
    And grep -o can fail due to a bug in some versions of grep.
  • condit
    condit about 10 years
    In Lynx 2.8.8 this has become lynx -dump -hiddenlinks=listonly my.html
  • SomniusX
    SomniusX almost 10 years
    this really did the work, many great thanx for this great awk bundle!
  • arjan
    arjan about 8 years
    This is the easiest and simplest answer. Just do e.g. wget http://sed.sourceforge.net/grabbag/scripts/list_urls.sed -O ~/bin/list_urls.sed && chmod +x ~/bin/list_urls.sed to get the script, and then wget http://www.example.com -O - | ~/bin/list_urls.sed > example.com.urls.txt to get the urls in a text file!
  • Raúl Salinas-Monteagudo
    Raúl Salinas-Monteagudo almost 7 years
    I strongly suggest to use a pipeline instead of temporary files: lynx -listonly -dump "$url" | awk 'FNR > 2 {print$2}'
  • smihael
    smihael over 6 years
    concat expects 2 arguments but here only one (base url is given). err:XPST0017: unknown function: concat #1 Did you mean: In module w3.org/2005/xpath-functions: concat #2-65535
  • Ingo Karkat
    Ingo Karkat over 6 years
    @smihael: You're right, that's superfluous here. Removed it. Thanks for noticing!
  • simon
    simon about 6 years
    The best option here if you don't want to use Lynx and your anchors don't start with <a href...
  • Roman Chernyatchik
    Roman Chernyatchik over 5 years
    thanks, works on Mac compared to many other solutions mentioned above