Website crawler/spider to get site map

website wget web-crawler sitemap

15,596

Solution 1

After a lot of research no tool has satisfied me, therefore I'm coding my own using http://scrapy.org/doc/

Solution 2

Here is an example of one made in python:

(Taken from http://theanti9.wordpress.com/2009/02/14/python-web-crawler-in-less-than-50-lines/ )

Also on that website there is a link to a github project http://github.com/theanti9/PyCrawler that is a more robust version the person made.

import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)

Solution 3

I personnaly use Kapow Katalyst, but I guess it's out of your budget. If not, it's probably the most intuitive software to create spiders, and much more if you need.

Solution 4

Technically speaking there is no foolproof way of extracting the directory structure of a website.

This is because HTTP is not a network file system. The only thing you can do with HTTP is follow the links from the starting page. Furthermore, there's nothing that requires the starting page to have links only to its immediate subdirectory. A top level index.html page may, for example, have a direct link to "foo/baz/blah.html", deep in some subdirectory.

Edit:

To generate basic site maps, Some online tools are there commonly known as Sitemap Generator. One such tool is web-site-map.com, it gives sitemap in XML.
If you are comfortable with programming then you can write your own web-spider, with specific set of rules of a particular site.

View more solutions

15,596

ack__

I work as an Information Security professional, solving problems like the ones on this website. If you need advisory or help, you can contact me.

Updated on September 18, 2022

Comments

ack__ over 1 year
I need to retrieve a whole website map, in a format like :
I need it to be linked-based (no file or dir brute-force), like :

parse homepage -> retrieve all links -> explore them -> retrieve links, ...

And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found :
I need to get only once the http://example.org/product/viewproduct

I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far.

The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language.

Thanks
ack__ over 11 years

Indeed I'm looking for a follow-link style spider. No problem with sites not having links only to sub-dirs, the soft can later trim the content found and organize it in tree-view. I don't want to rely on XML sitemaps as they don't present all the site's content. And as to program my own spider, this is something much more complicated than it looks (see various threads on stackoverflow), and it takes a huge lot of time.
ack__ over 11 years

Thanks, I didn't know about this one. I'll take a look, although I don't have budget for this at this time.
Hashim Aziz over 3 years

Did you get anywhere with this project?
ack__ over 3 years

I did! I used a mix of Scrapy and PhantomJS to achieve my goals. Nothing I can share though as it’s now licenced in a commercial software.