How to download with wget without following links with parameters

linux unix wget

13,959

Solution 1

wget --reject-regex '(.*)\?(.*)' http://example.com

(--reject-type posix by default). Works only for recent (>=1.14) versions of wget though, according to other comments.

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

wget --reject-regex 'expr1|expr2|…' http://example.com

Solution 2

The documentation for wget says:

Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings.

It looks like this functionality has been on the table for awhile and nothing has been done with it.

I haven't used it, but httrack looks like it has a more robust filtering feature set than wget and may be a better fit for what you're looking for (read about filters here http://www.httrack.com/html/fcguide.html).

Solution 3

The new version of wget (v.1.14) solves all these problems.

You have to use the new option --reject-regex=.... to handle query strings.

Note that I couldn't find the new manual that includes these new options, so you have to use the help command wget --help > help.txt

Solution 4

Pavuk should be able to do it:

http://pavuk.sourceforge.net/man.html#sect39

Mediawiki example:

[...]

-skip_url_pattern ’oldid=, action=edit, action=history, diff=, limit=, [/=]User:, [/=]User_talk:, [^p]/Special:, =Special:[^R], .php/Special:[^LUA][^onl][^nul], MediaWiki:, Search:, Help:’

[...]

Solution 5

It looks like you are trying to avoid download special pages of MediaWiki. I solved this problem once avoiding the index.php page:

wget  -R '*index.php*'  -r ... <wiki link>

However, the wiki used the URLS as seen in Wikipedia (http://<wiki>/en/Theme) and not the pattern I have seen in other places (http://<wiki>/index.php?title=Theme). Since the link you gave uses URLs in the Wikipedia pattern, I think this solution can work for you too, though.

View more solutions

13,959

Anna Reed

Updated on September 17, 2022

Comments

Anna Reed over 1 year
I'm trying to download two sites for inclusion on a CD:
```
http://boinc.berkeley.edu/trac/wiki
http://www.boinc-wiki.info
```
The problem I'm having is that these are both wikis. So when downloading with e.g.:
```
wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/
```
I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...

Does somebody know a way to get around this?

I just want the current pages, without images, and without diffs etc.

P.S.:
```
wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex
```
This worked for berkeley but boinc-wiki.info is still giving me trouble :/

P.P.S:

I got what appears to be the most relevant pages with:
```
wget -r -k -nv  -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info
```
- Kcmamu almost 14 years
  
  No need to cross post between superuser and serverfault superuser.com/questions/158318/…
- Anna Reed almost 14 years
  
  Where should I have posted it?
Spence almost 14 years

This works on query strings? Every version of wget I've used only applies reject list patterns to the file portion of the URL. I'll give it a shot and see.
Joshua Enfield almost 14 years

I haven't tested it. I just looked up the documentation. I did find it uses shell convention, but your experience would speak more than mine in regard to the working function of the matching.
Spence almost 14 years

Escaping the "?" doesn't seem to get wget to do what the OP would like on my CentOS 5.3 box running wget 1.11.4.
Stefan Lasiewski almost 14 years

+1 for pointing me to httrack. It looks better then wget, and wget is looking stagnant.
Anna Reed almost 14 years

I've tried Winhttrack but it behaves funny. It downloads files and traverses directories it should not :/
joeytwiddle over 12 years

Maybe one day wget will be fixed. For now httrack and pavuk both look good.