How to download HTTP directory with all files and sub-directories as they appear on the online files/folders list?

402,428

Solution 1

Solution:

wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/

Explanation:

  • It will download all files and subfolders in ddd directory
  • -r : recursively
  • -np : not going to upper directories, like ccc/…
  • -nH : not saving files to hostname folder
  • --cut-dirs=3 : but saving it to ddd by omitting first 3 folders aaa, bbb, ccc
  • -R index.html : excluding index.html files

Reference: http://bmwieczorek.wordpress.com/2008/10/01/wget-recursively-download-all-files-from-certain-directory-listed-by-apache/

Solution 2

I was able to get this to work thanks to this post utilizing VisualWGet. It worked great for me. The important part seems to be to check the -recursive flag (see image).

Also found that the -no-parent flag is important, othewise it will try to download everything.

enter image description here enter image description here

Solution 3

you can use lftp, the swish army knife of downloading if you have bigger files you can add --use-pget-n=10 to command

lftp -c 'mirror --parallel=100 https://example.com/files/ ;exit'

Solution 4

wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/

From man wget

‘-r’ ‘--recursive’ Turn on recursive retrieving. See Recursive Download, for more details. The default maximum depth is 5.

‘-np’ ‘--no-parent’ Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details.

‘-nH’ ‘--no-host-directories’ Disable generation of host-prefixed directories. By default, invoking Wget with ‘-r http://fly.srk.fer.hr/’ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.

‘--cut-dirs=number’ Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.

Take, for example, the directory at ‘ftp://ftp.xemacs.org/pub/xemacs/’. If you retrieve it with ‘-r’, it will be saved locally under ftp.xemacs.org/pub/xemacs/. While the ‘-nH’ option can remove the ftp.xemacs.org/ part, you are still stuck with pub/xemacs. This is where ‘--cut-dirs’ comes in handy; it makes Wget not “see” number remote directory components. Here are several examples of how ‘--cut-dirs’ option works.

No options -> ftp.xemacs.org/pub/xemacs/ -nH -> pub/xemacs/ -nH --cut-dirs=1 -> xemacs/ -nH --cut-dirs=2 -> .

--cut-dirs=1 -> ftp.xemacs.org/xemacs/ ... If you just want to get rid of the directory structure, this option is similar to a combination of ‘-nd’ and ‘-P’. However, unlike ‘-nd’, ‘--cut-dirs’ does not lose with subdirectories—for instance, with ‘-nH --cut-dirs=1’, a beta/ subdirectory will be placed to xemacs/beta, as one would expect.

Solution 5

No Software or Plugin required!

(only usable if you don't need recursive deptch)

Use bookmarklet. Drag this link in bookmarks, then edit and paste this code:

(function(){ var arr=[], l=document.links; var ext=prompt("select extension for download (all links containing that, will be downloaded.", ".mp3"); for(var i=0; i<l.length; i++) { if(l[i].href.indexOf(ext) !== false){ l[i].setAttribute("download",l[i].text); l[i].click(); } } })();

and go on page (from where you want to download files), and click that bookmarklet.

Share:
402,428

Related videos on Youtube

Omar
Author by

Omar

A learner who can learn anything by trying, failure, and keeping going!

Updated on May 15, 2020

Comments

  • Omar
    Omar about 4 years

    There is an online HTTP directory that I have access to. I have tried to download all sub-directories and files via wget. But, the problem is that when wget downloads sub-directories it downloads the index.html file which contains the list of files in that directory without downloading the files themselves.

    Is there a way to download the sub-directories and files without depth limit (as if the directory I want to download is just a folder which I want to copy to my computer).

    online HTTP directory

  • John
    John about 9 years
    Thank you! Also, FYI according to this you can use -R like -R css to exclude all CSS files, or use -A like -A pdf to only download PDF files.
  • jgrump2012
    jgrump2012 almost 8 years
    Thanks! Additional advice taken from wget man page When downloading from Internet servers, consider using the ‘-w’ option to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by your rudeness.
  • hamish
    hamish about 7 years
    I get this error 'wget' is not recognized as an internal or external command, operable program or batch file.
  • Mingjiang Shi
    Mingjiang Shi about 7 years
    @hamish you may need to install wget first or the wget is not in your $PATH.
  • Benoît Latinier
    Benoît Latinier almost 7 years
    Some explanations would be great.
  • SDsolar
    SDsolar over 6 years
    Just found this - Dec 2017. It works fine. I got it at sourceforge.net/projects/visualwget
  • coder3521
    coder3521 over 6 years
    Worked fine on Windows machine, don't forget to check in the options mentioned in the answer , else it won't work
  • Daniel Hershcovich
    Daniel Hershcovich about 6 years
    Great answer, but note that if there is a robots.txt file disallowing the downloading of files in the directory, this won't work. In that case you need to add -e robots=off . See unix.stackexchange.com/a/252564/10312
  • MilkyTech
    MilkyTech over 5 years
    I've install wget and can't get this to work. Not at all with cmd.exe but somewhat in windows powershell. If I just enter "wget someurl" it gives me a bunch of info but if I try to add any of the parameters I get an error that a paramater cannot be found that matches parameter name 'r'
  • user305883
    user305883 over 5 years
    In mac : Warning: Invalid character is found in given range. A specified range MUST Warning: have only digits in 'start'-'stop'. The server's response to this Warning: request is uncertain. curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information no result
  • Mingjiang Shi
    Mingjiang Shi over 5 years
    @user305883 the warning message you posted is from curl?
  • user305883
    user305883 over 5 years
    @MingjiangShi from wget (Command line from your answer). I also tried curl -O 'http://example.com/directory/' but does not go through : curl: Remote file name has no length!there is an html page with <pre> <a href="name.pdf">name.pdf</a> <a href="name2.pdf">name2.pdf</a> <a href="image1.png">image1.png</a> <a href="name3.pdf">name3.pdf</a>...</pre> and I wish to download all the listed documents (in the href).
  • Yannis Dran
    Yannis Dran about 5 years
    Doesn't work with certain https. @DaveLucre if you tried with wget in cmd solution you would be able to download as well, but some severs do not allow it I guess
  • Yannis Dran
    Yannis Dran about 5 years
    what about https? I have the warning: OpenSSL: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure Unable to establish SSL connection.
  • T.Todua
    T.Todua almost 5 years
    what does checked --no-parent do?
  • mateuscb
    mateuscb almost 5 years
    it's the same setting as wget (as one of the other answers here): ‘-np’ ‘--no-parent’ Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details.
  • Jolly1234
    Jolly1234 about 4 years
    To get rid of all the different types of index files (index.html?... etc) you need to ensure you add: -R index.html*
  • Mr Programmer
    Mr Programmer about 4 years
    Working in March 2020!
  • Admin
    Admin almost 4 years
    What about downloading a specific file type using VisualWget? Is it possible to download only mp3 files in a directory and its sub-directories in VisualWget?
  • Admin
    Admin almost 4 years
    What about downloading a specific file type using VisualWget? Is it possible to download only mp3 files in a directory and its sub-directories in VisualWget?
  • Admin
    Admin almost 4 years
    What about downloading a specific file type using VisualWget? Is it possible to download only mp3 files in a directory and its sub-directories in VisualWget?
  • n13
    n13 almost 4 years
    worked perfectly and really fast, this maxed out my internet line downloading thousands of small files. Very good.
  • Mujtaba
    Mujtaba over 3 years
    can anybody help me out, i have only getting 1 file index.html.tmp and a blank folder, can you please help me out what is the issue?
  • Namo
    Namo over 3 years
    I recommend below option: --reject-regex "(.*)\?(.*)"
  • leetbacoon
    leetbacoon over 3 years
    Explain what these parametres do please
  • nwgat
    nwgat over 3 years
    -c = continue, mirror = mirrors content locally, parallel=100 = downloads 100 files, ;exit = exits the program, use-pget = splits bigger files into segments and downloads parallels
  • Hassen Ch.
    Hassen Ch. over 3 years
    I had issues with this command. Some videos I was trying to download were broken. If I download them normally and individually from the browser it works perfectly.
  • Hassen Ch.
    Hassen Ch. over 3 years
    The most voted solution has no problem with any file. All good!
  • Jahan Zinedine
    Jahan Zinedine about 3 years
    Thanks @nwgat it worked like a charm, and matched my requirements.
  • a3k
    a3k almost 3 years
    Does this open the save as dialog for every file?
  • ßiansor Å. Ålmerol
    ßiansor Å. Ålmerol almost 3 years
    php files are all blank
  • MadHatter
    MadHatter almost 3 years
    This command works for me. Just one more thing, if there are other UTF-8 characters, we can add one more parameter "--restrict-file-names=nocontrol".
  • corl
    corl over 2 years
    This worked really well for me, exactly what I needed for my problem. Plus it is blindingly fast, especially with the --use-pget switch set. Thanks @nwgat
  • 0script0
    0script0 over 2 years
    Unfortunately, doesn't work for above case. It follows parent directory regardless --no-parent flag.
  • MindRoasterMir
    MindRoasterMir over 2 years
    This addon is not doing something. thanks
  • Akhil Raj
    Akhil Raj over 2 years
    Note that since the default depth limit of recursion is 5, you have to increase it by typing '-l <number>' to set the depth limit as desired. Use 'inf' or '0' for infinite depth.
  • Mark Miller
    Mark Miller almost 2 years
    Does this work from the command line in Windows 10?
  • Dave
    Dave almost 2 years
    Latest version of vwget (2.4.105.0) uses wget version 1.11, this does not work with with HTTPS sites. See this post for more info, could not get this to work at all unfortunately. stackoverflow.com/questions/28757232/…