wget - How to download recursively and only specific mime-types/extensions (i.e. text only)

wget recursive download mime-types

34,469

Solution 1

I've tried a totally different approach is to use Scrapy, however it has the same problem! Here's how I solved it: SO: Python Scrapy - mimetype based filter to avoid non-text file downloads?

The solution is to setup a Node.js proxy and configure Scrapy to use it through http_proxy environment variable.

What the proxy should do is:

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept all HTTP traffic.

For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response. This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code That actually works!

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

Solution 2

You could specify a list of allowed resp. disallowed filename patterns:

Allowed:

-A LIST
--accept LIST

Disallowed:

-R LIST
--reject LIST

LIST is comma-separated list of filename patterns/extensions.

You can use the following reserved characters to specify patterns:

Examples:

only download PNG files: -A png
don't download CSS files: -R css
don't download PNG files that start with "avatar": -R avatar*.png

If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).

Solution 3

You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.

Solution 4

A new Wget (Wget2) already has feature:

--filter-mime-type    Specify a list of mime types to be saved or ignored`

### `--filter-mime-type=list`

Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:

  wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*

It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.

Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to [email protected].

View more solutions

34,469

Omar Al-Ithawi

Django, Python and a little bit of React and JavaScript.

Updated on September 18, 2022

Comments

Omar Al-Ithawi almost 2 years

How to download a full website, but ignoring all binary files.

wget has this functionality using the -r flag but it downloads everything and some websites are just too much for a low-resources machine and it's not of a use for the specific reason I'm downloading the site.

Here is the command line i use: wget -P 20 -r -l 0 http://www.omardo.com/blog (my own blog)
James Andino over 11 years

Giving this a shot... ftp.gnu.org/gnu/wget I rolled the dice on just patching the newest version of wget with this but no luck( of course ). I would try to update the patch but I frankly don`t have the chops yet in c++ for it to not be a time sink. I did manage to grab the version of wget it was written for and get that running. I had trouble though compiling with ssl support because I could not figure out what version of openssl I needed to grab.
David Portabella almost 8 years

this looks great. any idea why this patch has not been yet accepted (four years later)?