Can I use WGET to generate a sitemap of a website given its URL?

18,406

Solution 1

wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://somewebsite.com
sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt

This creates a file called sedlog.txt that contains all links found on the specified website. You can use PHP or a shell script to convert the text file sitemap into an XML sitemap. Tweak the parameters of the wget command (accept/reject/include/exclude) to get only the links you need.

Solution 2

You can use this perl script to do the trick : http://code.google.com/p/perlsitemapgenerator/

Share:
18,406
Salman A
Author by

Salman A

Updated on June 15, 2022

Comments

  • Salman A
    Salman A almost 2 years

    I need a script that can spider a website and return the list of all crawled pages in plain-text or similar format; which I will submit to search engines as sitemap. Can I use WGET to generate a sitemap of a website? Or is there a PHP script that can do the same?

  • Salman A
    Salman A over 13 years
    It'll generate by scanning file system but won't "crawl". The sites I want to spider are dynamic.
  • Julian
    Julian almost 13 years
    +1 Couldn't quite use it like that as it was giving me a bunch of errors (probably because of different wget/sed versions). But once I did some tweaking, it worked like a charm. Thanks!
  • Liam
    Liam over 9 years
    You should add a small delay between requests using --wait=1, otherwise it might affect the performance of the site.
  • Phani Rithvij
    Phani Rithvij about 3 years
    Combined with tee unix.stackexchange.com/a/128476/312058 you can also see the output in stdout OR tail -f is even better
  • GDP2
    GDP2 about 3 years
    @Julian Yes, I had the same issue. On macOS, I had to use gsed instead of the builtin sed. Thanks for the tip!