Website crawler/spider to get site map
Solution 1
After a lot of research no tool has satisfied me, therefore I'm coding my own using http://scrapy.org/doc/
Solution 2
Here is an example of one made in python:
(Taken from http://theanti9.wordpress.com/2009/02/14/python-web-crawler-in-less-than-50-lines/ )
Also on that website there is a link to a github project http://github.com/theanti9/PyCrawler that is a more robust version the person made.
import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
while 1:
try:
crawling = tocrawl.pop()
print crawling
except KeyError:
raise StopIteration
url = urlparse.urlparse(crawling)
try:
response = urllib2.urlopen(crawling)
except:
continue
msg = response.read()
startPos = msg.find('<title>')
if startPos != -1:
endPos = msg.find('</title>', startPos+7)
if endPos != -1:
title = msg[startPos+7:endPos]
print title
keywordlist = keywordregex.findall(msg)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")
print keywordlist
links = linkregex.findall(msg)
crawled.add(crawling)
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.add(link)
Solution 3
I personnaly use Kapow Katalyst, but I guess it's out of your budget. If not, it's probably the most intuitive software to create spiders, and much more if you need.
Solution 4
Technically speaking there is no foolproof way of extracting the directory structure of a website.
This is because HTTP is not a network file system. The only thing you can do with HTTP is follow the links from the starting page. Furthermore, there's nothing that requires the starting page to have links only to its immediate subdirectory. A top level index.html page may, for example, have a direct link to "foo/baz/blah.html", deep in some subdirectory.
Edit:
To generate basic site maps, Some online tools are there commonly known as Sitemap Generator. One such tool is web-site-map.com, it gives sitemap in XML.
If you are comfortable with programming then you can write your own web-spider, with specific set of rules of a particular site.
Related videos on Youtube
ack__
I work as an Information Security professional, solving problems like the ones on this website. If you need advisory or help, you can contact me.
Updated on September 18, 2022Comments
-
ack__ over 1 year
I need to retrieve a whole website map, in a format like :
- http://example.org/
- http://example.org/product/
- http://example.org/service/
- http://example.org/about/
- http://example.org/product/viewproduct/
I need it to be linked-based (no file or dir brute-force), like :
parse homepage -> retrieve all links -> explore them -> retrieve links, ...
And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found :
- http://example.org/product/viewproduct?id=1
- http://example.org/product/viewproduct?id=2
- http://example.org/product/viewproduct?id=3
I need to get only once the http://example.org/product/viewproduct
I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far.
The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language.
Thanks
-
ack__ over 11 yearsIndeed I'm looking for a follow-link style spider. No problem with sites not having links only to sub-dirs, the soft can later trim the content found and organize it in tree-view. I don't want to rely on XML sitemaps as they don't present all the site's content. And as to program my own spider, this is something much more complicated than it looks (see various threads on stackoverflow), and it takes a huge lot of time.
-
ack__ over 11 yearsThanks, I didn't know about this one. I'll take a look, although I don't have budget for this at this time.
-
Hashim Aziz over 3 yearsDid you get anywhere with this project?
-
ack__ over 3 yearsI did! I used a mix of Scrapy and PhantomJS to achieve my goals. Nothing I can share though as it’s now licenced in a commercial software.