Wget: downloading files selectively and recursively?

12,527

This command will download only images and movies from a given website:

wget -nd -r -P /save/location -A jpeg,jpg,bmp,gif,png,mov "http://www.somedomain.com"

According to wget man:

-nd prevents the creation of a directory hierarchy (i.e. no directories).

-r enables recursive retrieval. See Recursive Download for more information.

-P sets the directory prefix where all files and directories are saved to.

-A sets a whitelist for retrieving only certain file types. Strings and patterns are accepted, and both can be used in a comma separated list (as seen above). See Types of Files for more information.

If you would like to download subfolders you need to use the flag --no-parent, something similar to this command:

wget -r -l1 --no-parent -P /save/location -A jpeg,jpg,bmp,gif,png,mov "http://www.somedomain.com"

-r: recursive retrieving
-l1: sets the maximum recursion depth to be 1
--no-parent: does not ascend to the parent; only downloads from the specified subdirectory and downwards hierarchy

Regarding the index.html webpage. It will be excluded once the flag -A is included in the command wget, because this flag will force wget to download specific type of files, meaning if html is not included in the list of accepted files to be downloaded (i.e. flag A), then it will not be downloaded and wget will output in terminal the following message:

Removing /save/location/default.htm since it should be rejected.

wget can download specific type of files e.g. (jpg, jpeg, png, mov, avi, mpeg, .... etc) when those files are exist in the URL link provided to wget for example:

Let's say we would like to download .zip and .chd files from this website

In this link there are folders and .zip files (scroll to the end). Now, let's say we would like to run this command:

wget -r --no-parent -P /save/location -A chd,zip "https://archive.org/download/MAME0.139_MAME2010_Reference_Set_ROMs_CHDs_Samples/roms/"

This command will download .zip files and at the same time it will create an empty folders for the .chd files.

In order to download the .chd files, we would need to extract the names of the empty folders, then convert those folder names to its actual URLs. Then, put all the URLs of interest in a text file file.txt, finally feed this text file to wget, as follows:

wget -r --no-parent -P /save/location -A chd,zip -i file.txt

The previous command will find all the chd files.

Share:
12,527

Related videos on Youtube

T. Caio
Author by

T. Caio

Updated on September 18, 2022

Comments

  • T. Caio
    T. Caio over 1 year

    Question about wget, subfolder, and index.html.

    Let's say I am inside "travels/" folder and this is in "website.com": "website.com/travels/".

    Folder "travels/" contains a lot of files and other (sub)folders: "website.com/travels/list.doc" , "website.com/travels/cover.png" , "website.com/travels/[1990] America/" , "website.com/travels/[1994] Japan/", and so on...

    How can I download solely all ".mov" and ".jpg" that resides in all the subfolders only? I don't want to pick files from "travels/" (e.g. not "website.com/travels/list.doc")

    I found a wget command (on Unix&Linux Exchange, I don't remember what was the discussion) capable of downloading from subfolders only their "index.html", not others contents. Why download only index files?

    • Admin
      Admin over 5 years
      Hi @T. Caio would you please correct your link. it seems not the correct one!
    • T. Caio
      T. Caio over 5 years
      Hi @Goro, what link should I correct? Sorry, I'm not english-speaker and I'm quite new to Linux
    • Admin
      Admin over 5 years
      In the question you said Here on https://unix.stackexchange.com ... there is no question about wget in this link! you probably copy/paste the unix website link
    • Admin
      Admin over 5 years
      So you would like to know how to download (only) images an videos from a website subfolders, is this correct?
    • T. Caio
      T. Caio over 5 years
      @Goro Correct! The subfolders are more than one
    • Admin
      Admin over 5 years
      please try below and let me know if you have additional questions ;-) I encourage you to read wget man gnu.org/software/wget/manual/… very useful!
    • T. Caio
      T. Caio over 5 years
      I'll try in a few hours. But from what I see your suggestion seems similar to the one I already try and that grabbed only the index.html files inside the subfolders
    • Admin
      Admin over 5 years
      The flag -A in the command means accept specific extensions. You can add the flag -R and a list of files extensions that you don't want to download e.g. *.html The flag -l 1 means download from subdirectories in depth 1 from the parent folder, you can adjust the depth of the subfolders based on this flag
    • T. Caio
      T. Caio over 5 years
      @Goro I'll let you know. See you later, thanks!
  • T. Caio
    T. Caio over 5 years
    wget -r -l2 --no-parent -P /my/local/path/ -A jpg https://website.com/remotefolder/ is NOT working (for my needs). wget "entered" in all subfolders, but for each one it only downloaded respective "index.html" files (removing them because rejected). It didn't even try to download further contents!
  • T. Caio
    T. Caio over 5 years
    @Guru with your last try, wget keep continuously on entering in all the subfolders but the only thing it do is to (try) to download "index.html" (rejected), only that. Nothing more. Seems like wget is blind to what's inside in every subfolders...
  • T. Caio
    T. Caio over 5 years
    @Guru I tried omitting the -A option: only "index.html" is downloaded no other files (neither sub-subfolders). GNU Wget 1.19.5 on Arch Linux x86_64
  • T. Caio
    T. Caio over 5 years
    wget -r --no-parent -P /local/path -A chd https://archive.org/download/MAME0.139_MAME2010_Reference_Se‌​t_ROMs_CHDs_Samples/‌​roms/
  • T. Caio
    T. Caio over 5 years
    your last command still is NOT working. Could you check if maybe the reason (for my case) is something related to?: unix.stackexchange.com/questions/293283/…
  • Admin
    Admin over 5 years
    I don't know how to proof to you that it is working. But I know that I got png files from bing. Also I tried www.google.com and I got images!
  • T. Caio
    T. Caio over 5 years
    mmm... You tried my exact command? Maybe that website has some weird configs?
  • T. Caio
    T. Caio over 5 years
    with -erobots=off after the flag -rthe behavior still remains the same. Keep "entering" inside each subfolder but the only thing wget see and try to download is solely "index.html"...
  • T. Caio
    T. Caio over 5 years
  • Admin
    Admin over 5 years
    @ T. Caio Please see my edits. Let's chat!
  • T. Caio
    T. Caio over 5 years
    Yes! Worked! I am left with empty "somename.zip" folders corresponding to the *.zip inside the parent. Then I also have all the subfolders that I need, each one with with its *.chd content! wget -r --no-parent -P /local/path/ -A chd https://archive.org/download/MAME0.139_MAME2010_Reference_Se‌​t_ROMs_CHDs_Samples/‌​roms/