How to get list of URLs for a domain
Solution 1
Seems there is no royal way to web crawling, so I will just stick to my current approach...
Also I found most search engines only expose the first 1000 results anyway.
Solution 2
Some webmasters offer Sitemaps, which are essentially XML lists of every URL on the domain. However, there is no general solution except crawling. If you do use a crawler, please obey robots.txt.
Comments
-
hoju almost 2 years
I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data?
One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I would have to scrape the search results. Google also supports site search but doesn't offer an easy way to download the data.
Can you think of a better way that would work with most (if not all) websites?
thanks, Richard