How to download with wget without following links with parameters
Solution 1
wget --reject-regex '(.*)\?(.*)' http://example.com
(--reject-type posix
by default). Works only for recent (>=1.14) versions of wget
though, according to other comments.
Beware that it seems you can use --reject-regex
only once per wget
call. That is, you have to use |
in a single regex if you want to select on several regex :
wget --reject-regex 'expr1|expr2|…' http://example.com
Solution 2
The documentation for wget says:
Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings.
It looks like this functionality has been on the table for awhile and nothing has been done with it.
I haven't used it, but httrack looks like it has a more robust filtering feature set than wget and may be a better fit for what you're looking for (read about filters here http://www.httrack.com/html/fcguide.html).
Solution 3
The new version of wget (v.1.14) solves all these problems.
You have to use the new option --reject-regex=....
to handle query strings.
Note that I couldn't find the new manual that includes these new options, so you have to use the help command wget --help > help.txt
Solution 4
Pavuk should be able to do it:
http://pavuk.sourceforge.net/man.html#sect39
Mediawiki example:
[...]
-skip_url_pattern ’oldid=, action=edit, action=history, diff=, limit=, [/=]User:, [/=]User_talk:, [^p]/Special:, =Special:[^R], .php/Special:[^LUA][^onl][^nul], MediaWiki:, Search:, Help:’
[...]
Solution 5
It looks like you are trying to avoid download special pages of MediaWiki. I solved this problem once avoiding the index.php
page:
wget -R '*index.php*' -r ... <wiki link>
However, the wiki used the URLS as seen in Wikipedia (http://<wiki>/en/Theme
) and not the pattern I have seen in other places (http://<wiki>/index.php?title=Theme
). Since the link you gave uses URLs in the Wikipedia pattern, I think this solution can work for you too, though.
Related videos on Youtube
Anna Reed
Updated on September 17, 2022Comments
-
Anna Reed over 1 year
I'm trying to download two sites for inclusion on a CD:
http://boinc.berkeley.edu/trac/wiki http://www.boinc-wiki.info
The problem I'm having is that these are both wikis. So when downloading with e.g.:
wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/
I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...
Does somebody know a way to get around this?
I just want the current pages, without images, and without diffs etc.
P.S.:
wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex
This worked for berkeley but boinc-wiki.info is still giving me trouble :/
P.P.S:
I got what appears to be the most relevant pages with:
wget -r -k -nv -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info
-
Kcmamu almost 14 yearsNo need to cross post between superuser and serverfault superuser.com/questions/158318/…
-
Anna Reed almost 14 yearsWhere should I have posted it?
-
-
Spence almost 14 yearsThis works on query strings? Every version of wget I've used only applies reject list patterns to the file portion of the URL. I'll give it a shot and see.
-
Joshua Enfield almost 14 yearsI haven't tested it. I just looked up the documentation. I did find it uses shell convention, but your experience would speak more than mine in regard to the working function of the matching.
-
Spence almost 14 yearsEscaping the "?" doesn't seem to get wget to do what the OP would like on my CentOS 5.3 box running wget 1.11.4.
-
Stefan Lasiewski almost 14 years+1 for pointing me to httrack. It looks better then wget, and wget is looking stagnant.
-
Anna Reed almost 14 yearsI've tried Winhttrack but it behaves funny. It downloads files and traverses directories it should not :/
-
joeytwiddle over 12 yearsMaybe one day wget will be fixed. For now httrack and pavuk both look good.