How to download with wget without following links with parameters

5,800

Solution 1

The new version of wget (v.1.14) solves all these problems.

You have to use the new option --reject-regex=.... to handle query strings.

Note that I couldn't find the new manual that includes these new options, so you have to use the help command wget --help > help.txt

Solution 2

wget --reject-regex '(.*)\?(.*)' http://example.com

(--reject-type posix by default). Works only for recent (>=1.14) versions of wget though, according to other comments.

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

wget --reject-regex 'expr1|expr2|…' http://example.com
Share:
5,800

Related videos on Youtube

Tie-fighter
Author by

Tie-fighter

Updated on September 17, 2022

Comments

  • Tie-fighter
    Tie-fighter over 1 year

    I'm trying to download two sites for inclusion on a CD:

    http://boinc.berkeley.edu/trac/wiki
    http://www.boinc-wiki.info
    

    The problem I'm having is that these are both wikis. So when downloading with e.g.:

    wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/
    

    I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...

    Does somebody know a way to get around this?

    I just want the current pages, without images, and without diffs etc.

    P.S.:

    wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex
    

    This worked for berkeley but boinc-wiki.info is still giving me trouble :/

    P.P.S:

    I got what appears to be the most relevant pages with:

    wget -r -k -nv  -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info
    
    • Bryan
      Bryan almost 14 years
      No need to cross post between superuser and serverfault serverfault.com/questions/156045/…
    • Tie-fighter
      Tie-fighter almost 14 years
      Where should I have posted it?
    • David Z
      David Z almost 14 years
      this is the right place. It's not a server question.
    • Tie-fighter
      Tie-fighter almost 14 years
      Still I got the better answers at serverfault ;)
  • Tie-fighter
    Tie-fighter almost 14 years
    "Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings."
  • Daisetsu
    Daisetsu almost 14 years
    Hmm, I must have missed that. It looks like you can't do this with wget then if it doesn't even know that they are different files. I suggest a different program.
  • barlop
    barlop over 10 years
    there is -w seconds. e.g. -w 5. gnu.org/software/wget/manual/html_node/…
  • yunzen
    yunzen about 10 years
    Could be true about the version requirement. I had v1.12 and the option was not valid. After upgrade to v1.15 it was.
  • Amir Dadgari
    Amir Dadgari over 8 years
    Regex alternation using the | ("pipe") symbol isn't working for me with GNU Wget 1.16.