How to get list of URLs for a domain

18,163

Solution 1

Seems there is no royal way to web crawling, so I will just stick to my current approach...

Also I found most search engines only expose the first 1000 results anyway.

Solution 2

Some webmasters offer Sitemaps, which are essentially XML lists of every URL on the domain. However, there is no general solution except crawling. If you do use a crawler, please obey robots.txt.

Share:
18,163
hoju
Author by

hoju

nothing to see here, move along now

Updated on June 15, 2022

Comments

  • hoju
    hoju almost 2 years

    I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data?

    One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I would have to scrape the search results. Google also supports site search but doesn't offer an easy way to download the data.

    Can you think of a better way that would work with most (if not all) websites?

    thanks, Richard