Why wget is not willing to download recursively?

9,710

Solution 1

I tested this, and found the issue:

wget respects robots.txt unless explicitly told not to.

wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52--  http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: ‘www.comp.brad.ac.uk/research/GIP/tutorials/index.html’

www.comp.brad.ac.uk/research/GI 100%[======================================================>]     878  --.-KB/s   in 0s     

2015-12-31 12:29:53 (31.9 MB/s) - ‘www.comp.brad.ac.uk/research/GIP/tutorials/index.html’ saved [878/878]

Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53--  http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: ‘www.comp.brad.ac.uk/robots.txt’

www.comp.brad.ac.uk/robots.txt  100%[======================================================>]      26  --.-KB/s   in 0s     

2015-12-31 12:29:53 (1.02 MB/s) - ‘www.comp.brad.ac.uk/robots.txt’ saved [26/26]

FINISHED --2015-12-31 12:29:53--

As you can see, wget did what it was asked by you, perfectly.

What does the robots.txt say in this case?

cat robots.txt
User-agent: *
Disallow: /

So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.

wget -r -erobots=off  http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.

There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:

wget -r -erobots=off -l2 -np  http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
  • -l2 means 2 levels max. -l means: level.
  • -np means don't go UP in the tree, just in, from the start page. -np means: no parent.

It just depends on the target page, sometimes you want to specify exactly what to get and not get, for example, in this case, you are only getting the default of .html/.htm extensions, not graphics, pdfs, music/video extensions. The -A option lets you add extension types to grab.

By the way, I checked and my wget, version 1.17, is from 2015. Not sure what version you are using. Python by the way I think was also created in the 90s, so by your reasoning, python is also junk from the 90s.

I admit the wget --help is quite intense and feature rich, as is the wget man page, so it's understandable why someone would want to not read it, but there are tons of online tutorials that tell you how do most common wget actions.

Solution 2

Same answer as above, but without unnecessary smugness:

wget respects site's robots.txt and may not go down recursively if robots.txt disallows it. To disable this behavior, add flag -erobots=off.

Share:
9,710

Related videos on Youtube

foobar
Author by

foobar

Updated on September 18, 2022

Comments

  • foobar
    foobar almost 2 years

    The command

    $ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
    

    only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example

    <A HREF="viewp.html">Viewpoint specification</A>
    

    Why does wget ignore that?

  • foobar
    foobar over 8 years
    Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
  • foobar
    foobar over 8 years
    There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail! man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.