Spider a Website and Return URLs Only

83,107

Solution 1

The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.

wget --spider --force-html -r -l2 $url 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
  > urls.m3u

This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.

The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

Solution 2

Create a few regular expressions to extract the addresses from all

<a href="(ADDRESS_IS_HERE)">.

Here is the solution I would use:

wget -q http://example.com -O - | \
    tr "\t\r\n'" '   "' | \
    grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
    sed -e 's/^.*"\([^"]\+\)".*$/\1/g'

This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.

Explanation regarding the options used in the series of piped commands:

wget -q makes it not have excessive output (quiet mode). wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.

tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.

grep -i makes the search case-insensitive grep -o makes it output only the matching portions.

sed is the Stream EDitor unix utility which allows for filtering and transformation operations.

sed -e just lets you feed it an expression.

Running this little script on "http://craigslist.org" yielded quite a long list of links:

http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...

Solution 3

I've used a tool called xidel

xidel http://server -e '//a/@href' | 
grep -v "http" | 
sort -u | 
xargs -L1 -I {}  xidel http://server/{} -e '//a/@href' | 
grep -v "http" | sort -u

A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!

Share:
83,107
Rob Wilkerson
Author by

Rob Wilkerson

I am a development manager and engineer, but I haven't always been one. I've also been an architect, a carpenter and a paratrooper (never a butcher, baker or candlestick maker). I have nearly 15 years of experience designing, engineering and developing Internet solutions. That experience extends to building and leading intra- and international development teams and organizing those teams around an evolving set of tools, standards, practices and processes. Sadly, I still can't design my way out of a wet paper bag.

Updated on July 05, 2022

Comments

  • Rob Wilkerson
    Rob Wilkerson almost 2 years

    I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:

    wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'
    

    The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?

    UPDATE

    So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:

    wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'
    

    I'd still be interested in other/better means for doing this kind of thing, if any exist.

  • AKX
    AKX about 11 years
    wget -r --spider -l1 -A mp3 http://example.com/page-with-mp3s 2>&1 | grep -Eio http.+mp3 was a good magic ticket for me. Thanks!
  • Snowy
    Snowy almost 11 years
    Very cool. But the Win32 versions of the tools are choking... Somewhere. Can you say how to modify them for Cygwin or straight Windows?
  • Jay Taylor
    Jay Taylor almost 11 years
    @Snowy I'm not sure what you mean by "choking". Cygwin should work fine. You could also try using curl instead of wget.
  • AL the X
    AL the X over 10 years
    I typically pass that output to sort | uniq to remove duplicates, FYI.
  • Joe
    Joe about 9 years
    Thanks ... that looks perfect for scripting a workaround to my wget problem ( opendata.stackexchange.com/q/4851/263 )
  • erdomester
    erdomester almost 9 years
    I know 5 years has passed since this answer but can you speed up the process? It takes seconds or even minutes for sites with 200 urls
  • erdomester
    erdomester almost 9 years
    I would like to point out that @Rob wanted to get all urls from a website and not from a webpage.
  • BarbaraKwarc
    BarbaraKwarc over 7 years
    OK nevermind, I changed the grep command to this: grep -i -o '<a[^>]\+href[ ]*=[ \t]*"[^"]\+">[^<]*</a>' and removed the sed and it seems to do the job. Now I just need to parse these A tags somehow.
  • Volomike
    Volomike about 6 years
    You can shorten the time greatly on this if you replace the first grep and awk commands with a single egrep -o 'https?://[^ ]+'. I too also recommend piping to sort | uniq because that can reduce the work of the third party tool on repeat URLs.