How to download with wget without following links with parameters

13,959

Solution 1

wget --reject-regex '(.*)\?(.*)' http://example.com

(--reject-type posix by default). Works only for recent (>=1.14) versions of wget though, according to other comments.

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

wget --reject-regex 'expr1|expr2|…' http://example.com

Solution 2

The documentation for wget says:

Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings.

It looks like this functionality has been on the table for awhile and nothing has been done with it.

I haven't used it, but httrack looks like it has a more robust filtering feature set than wget and may be a better fit for what you're looking for (read about filters here http://www.httrack.com/html/fcguide.html).

Solution 3

The new version of wget (v.1.14) solves all these problems.

You have to use the new option --reject-regex=.... to handle query strings.

Note that I couldn't find the new manual that includes these new options, so you have to use the help command wget --help > help.txt

Solution 4

Pavuk should be able to do it:

http://pavuk.sourceforge.net/man.html#sect39

Mediawiki example:

[...]

-skip_url_pattern ’oldid=, action=edit, action=history, diff=, limit=, [/=]User:, [/=]User_talk:, [^p]/Special:, =Special:[^R], .php/Special:[^LUA][^onl][^nul], MediaWiki:, Search:, Help:

[...]

Solution 5

It looks like you are trying to avoid download special pages of MediaWiki. I solved this problem once avoiding the index.php page:

wget  -R '*index.php*'  -r ... <wiki link>

However, the wiki used the URLS as seen in Wikipedia (http://<wiki>/en/Theme) and not the pattern I have seen in other places (http://<wiki>/index.php?title=Theme). Since the link you gave uses URLs in the Wikipedia pattern, I think this solution can work for you too, though.

Share:
13,959

Related videos on Youtube

Anna Reed
Author by

Anna Reed

Updated on September 17, 2022

Comments

  • Anna Reed
    Anna Reed over 1 year

    I'm trying to download two sites for inclusion on a CD:

    http://boinc.berkeley.edu/trac/wiki
    http://www.boinc-wiki.info
    

    The problem I'm having is that these are both wikis. So when downloading with e.g.:

    wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/
    

    I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...

    Does somebody know a way to get around this?

    I just want the current pages, without images, and without diffs etc.

    P.S.:

    wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex
    

    This worked for berkeley but boinc-wiki.info is still giving me trouble :/

    P.P.S:

    I got what appears to be the most relevant pages with:

    wget -r -k -nv  -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info
    
    • Kcmamu
      Kcmamu almost 14 years
      No need to cross post between superuser and serverfault superuser.com/questions/158318/…
    • Anna Reed
      Anna Reed almost 14 years
      Where should I have posted it?
  • Spence
    Spence almost 14 years
    This works on query strings? Every version of wget I've used only applies reject list patterns to the file portion of the URL. I'll give it a shot and see.
  • Joshua Enfield
    Joshua Enfield almost 14 years
    I haven't tested it. I just looked up the documentation. I did find it uses shell convention, but your experience would speak more than mine in regard to the working function of the matching.
  • Spence
    Spence almost 14 years
    Escaping the "?" doesn't seem to get wget to do what the OP would like on my CentOS 5.3 box running wget 1.11.4.
  • Stefan Lasiewski
    Stefan Lasiewski almost 14 years
    +1 for pointing me to httrack. It looks better then wget, and wget is looking stagnant.
  • Anna Reed
    Anna Reed almost 14 years
    I've tried Winhttrack but it behaves funny. It downloads files and traverses directories it should not :/
  • joeytwiddle
    joeytwiddle over 12 years
    Maybe one day wget will be fixed. For now httrack and pavuk both look good.