How to make wget download recursive combining --accept with --exclude-directories?

wget http

8,184

Rather than try and do this using wget I'd suggest using a more appropriate tool for downloading complex "sets" of files or filters.

You can use httrack to download either entire directories of files (essentially mirror everything from a site) or you can specify to httrack a filter along with specific file extensions, such as download only .pdf files.

You can read more about httrack's filter capability which is what you'd need to use if you were interested in only downloading files that were named in a specific way.

Here are some examples of the wildcard capability:

*[file] or *[name] - any filename or name, e.g. not /,? and ; characters
*[path] - any path (and filename), e.g. not ? and ; characters
*[a,z,e,r,t,y] - any letters among a,z,e,r,t,y
*[a-z] - any letters
*[0-9,a,z,e,r,t,y] - any characters among 0..9 and a,z,e,r,t,y

Example

$ httrack http://url.com/files/ -* +1_[a-z].doc -O /dir/to/output

The switches are as follows:

-* - remove everything from list of things to download
+1_[a-z].doc - download files named 1_a.doc, 1_b.doc, etc.
-O /dir/to/output - write results here

8,184

Twisted89

I'm Elias Dorneles: programmer, musician and addicted to learning things. Brazilian living in France. In 2017 I attended Recurse Center, a retreat for programmers in NYC. I contribute to BeeWare. I like open source software, Python, Bash, Javascript, GNU/Linux/Ubuntu, Wikipedia, fingerstyle guitar and reading everything that I can. Blog Resume

Updated on September 18, 2022

Comments

Twisted89 4 months
I'm trying to download some directories from an Apache server, but I need to ignore some directories that have huge files I don't care about

The dir structure in the server is somewhat like this (simplified):
```
somedir/
├── atxt.txt
├── big_file.pdf
├── image.jpg
└── tmp
    └── tempfile.txt
```
So, I want to get all the .txt and .jpg files, but I DON'T want the .pdf files nor anything that is in a tmp directory.

I've tried using --exclude-directories together with --accept and then with --reject, but in both attempts it keeps downloading the tmp dir and its contents.

These are the commands I've tried:
```
# with --reject
wget -nH --cut-dirs=2 -r --reject=pdf --exclude-directories=tmp \
         --no-parent  http://<host>/pub/somedir/
# with --accept
wget -nH --cut-dirs=2 -r --accept=txt,jpg --exclude-directories=tmp \
         --no-parent  http://<host>/pub/somedir/
```
Is there a way to do this?

How exactly is --exclude-directories supposed to work?
S edwards almost 9 years

httrack is definitely a better way.
Admin almost 9 years

httrack -W always recommended.
slm almost 9 years

@elias - in the man page it says it takes wildcards, so perhaps you need to define the "directories" using something like */tmp/*.
Stéphane Gourichon about 8 years

httrack does not support custom headers (needed for authentication). Wget does.
MattBianco about 8 years

There is also cURL, (curl.haxx.se) which is very powerful.