Using wget to mirror a website and everything from the first level of external sites

10,297

Solution 1

This is unfortunately impossible with wget (and the attempt at solving this with -H -l 1 does not do what you expect). What you want is HTTrack.

httrack --ext-depth=1 http://example.com

This can also be abbreviated as httrack %e1 http://example.com. Note that HTTrack counts levels starting at 1, not 0, so it won't follow links found on external pages unless you increase the depth.

Solution 2

I would use a combination wget -m -k -K -p http://example.com && wget -r -k -K -H -N -l 1 http://example.com.

About the two commands: wget -m -k -K -p http://example.com will mirror (-m = -r --level=inf -N) it, convert the links to your local mirror (-k), backs up the original file before it gets converted (-K) and downloads all prerequisites for proper viewing the mirror (-p).

After that the second command wget -r -k -K -H -N -l 1 http://example.com would do essentially the same but only for one level spanning all hosts and it would check the timestamps with -N, so you wouldn't download the same files again. I didn't include the -p option here, because it could download very much then...

Share:
10,297

Related videos on Youtube

Admin
Author by

Admin

Updated on September 17, 2022

Comments

  • Admin
    Admin over 1 year

    I need to mirror a particular website (all the pages under that particular domain) any pages (but not whole sites) that the website links to.

    I'm confused about the how to do this

    wget -r --level=inf (or some other variant) will mirror the site.

    wget -r -H --level=1 will get all the links (from all domains) to the first level.

    Anyone have any ideas on how I could combine these, to get the entire of the main site and one level deep into external sites. I've been banging my head against the manual all afternoon.

    Thanks