Using Wget to Recursively Crawl a Site and Download Images

linux bash script web-crawler wget

41,109

Solution 1

Why won't you try to use wget -A jpg,jpeg -r http://example.com?

Solution 2

How do you expect wget to know the contents of subpage13.html (and so the jpg's that it links to) if it is not allowed to download it. I suggest you allow html, get what you want, then remove what you don't want.

I'm not quite sure about why your cgi's are getting rejected... is there any error output by wget? Perhaps make wget verbose (-v) and see. Might be best as a separate question.

That said, if you don't care about bandwidth and download lots then remove what you don't want after, it doesn't matter.

Also check out --html-extension

From the man page:

-E

--html-extension

If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp .[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when youâre mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when youâre downloading CGI-gener- ated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.

Note that filenames changed in this way will be re-downloaded every time you re-mirror a site, because Wget canât tell that the local X.html file corresponds to remote URL X (since it doesnât yet know that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k and -K so that the original version of the file will be saved as X.orig.

--restrict-file-names=unix might also be useful due to those cgi urls...

41,109

Author by

Cerin

Updated on September 18, 2022

Comments

Cerin over 1 year
How do you instruct wget to recursively crawl a website and only download certain types of images?

I tried using this to crawl a site and only download Jpeg images:
```
wget --no-parent --wait=10 --limit-rate=100K --recursive --accept=jpg,jpeg --no-directories http://somedomain/images/page1.html
```
However, even though page1.html contains hundreds of links to subpages, which themselves have direct links to images, wget reports things like "Removing subpage13.html since it should be rejected", and never downloads any images, since none are directly linked to from the starting page.

I'm assuming this is because my --accept is being used to both direct the crawl and filter content to download, whereas I want it used only to direct the download of content. How can I make wget crawl all links, but only download files with certain extensions like *.jpeg?

EDIT: Also, some pages are dynamic, and are generated via a CGI script (e.g. img.cgi?fo9s0f989wefw90e). Even if I add cgi to my accept list (e.g. --accept=jpg,jpeg,html,cgi) these still always get rejected. Is there a way around this?
- CJ7 about 4 years
  
  How did you go with this issue? Is this just a limitation of wget ?
Cerin about 13 years

That downloads all linked media. The only way to use wget to download images is to download ALL content on a page?!
Pricey about 13 years

I should stop linking wget options.. was about to point out --no-parent but I will stop there.
Charles Stewart over 11 years

The question states that some of the images are of the form /url/path.cgi?query, so your suggestion will not fetch those.