How can I make wget rename downloaded files to not include the query string?

33,869

Solution 1

If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling wget to listen to that header for the final filename is as simple as:

wget --content-disposition

You'll need a newish version of wget to use this feature.

I have no idea how well it handles a server claiming a filename of '/etc/passwd'.

Solution 2

I realized after processing a large batch that I should have instructed wget to ignore the query strings. I did not want to do it over again so I made this script which worked for me:

# /bin/bash
for i in `find $1 -type f`
do
    mv $i `echo $i | cut -d? -f1`
done

Put that in a file like rmqstr and chmod +x rmqstr Syntax: ./rmqstr <directory (defaults to .)>

It will remove the query strings from all filenames recursively.

Solution 3

I think, in order to get wget to save as a filename different than the URL specifies, you need to use the -O filename argument. That only does what you want when you give it a single URL -- with multiple URLs, all downloaded content ends up in filename.

But that's really the answer. Instead of trying to do it all in one wget command, use multiple commands. Now your workflow becomes:

  1. Run wget to get the base HTML file(s) containing your links;
  2. Parse for URLs;
  3. Foreach URL ending in mp3,
    1. process URL to get a filename (eg turn http://foo/bar/baz.mp3?gargle=blaster into baz.mp3
    2. (optional) check that filename doesn't exist
    3. run wget <URL> -O <filename>

That solves your problem, but now you need to figure out how to grab the base files to find your mp3 URLs.

Do you have a particular site/base URL in mind? Steps 1 and 3 will be easier to handle with a concrete example.

Solution 4

I have a similar approach as @Gregory Wolf because his code always created error messages like this:

mv: './file' and './file' are the same file

Thus I first check if there is a query string in the filename before moving the file:

for f in $(find $1 -type f); do
    if [ $f = ${f%%\?*} ]; then continue; fi
    mv "${f}" "${f%%\?*}"
done

This will recursively check every file and remove all query strings in their filenames if available.

Solution 5

Look at these two commands I created to clone a site, and after clone is done, you can execute second command.

The second command will take a look in entire clone, search for "?" file pattern names, and will remove query string from the file name.

# Clone entire site.
    wget --content-disposition --execute robots=off --recursive --no-parent --continue --no-clobber http://example.com

# Remove query string from a static resource.
for i in `find $1 -type f -name "*\?*"`; do mv $i `echo $i | cut -d? -f1`; done

(See it in GitHub Gist.)

Share:
33,869

Related videos on Youtube

Keith Twombley
Author by

Keith Twombley

Updated on September 17, 2022

Comments

  • Keith Twombley
    Keith Twombley over 1 year

    I'm downloading a site with wget and a lot of the links have queries attached to them, so when I do this:

    wget -nv -c -r -H -A mp3 -nd http://url.to.old.podcasts.com/
    

    I end up with a lot of files like this:

    1.mp3?foo=bar
    2.mp3?blatz=pow
    3.mp3?fizz=buzz
    

    What I'd like to end up with is:

    1.mp3
    2.mp3
    3.mp3
    

    This is all taking place in ubuntu linux and I've got wget 1.10.2.

    I know I can do this after I get everything via a script to rename everything. However I'd really like a solution from within wget so I can see the correct names as the download is happening.

    Can anyone help me unravel this?

    • Deniz Zoeteman
      Deniz Zoeteman over 14 years
      Post your question at www.stackoverflow.com.
    • quack quixote
      quack quixote over 14 years
      @TutorialPoint why? question is looking for a within-wget-way-to-do-it, SO would just migrate it back here.
    • Walter Kiess
      Walter Kiess over 14 years
      Well, there is no within-wget-way-to-do-it
    • quack quixote
      quack quixote over 14 years
      @ayrnieu: not in one command, no. and not without a helper. but you can certainly do it with as few as n+1 wget commands (if not fewer).
  • Michael Mior
    Michael Mior over 9 years
    This somewhat solves the issue of the filenames being displayed, but the OP also wants the final file name not to have the query string.
  • Ramhound
    Ramhound over 8 years
    Can you please quote the relevant information from the link, so we know which material, you believe answers this question.
  • Arkadiusz 'flies' Rzadkowolski
    Arkadiusz 'flies' Rzadkowolski about 5 years
    I would add ` -name "\?"` to find part to limit only to needed files :)
  • No name
    No name about 5 years
    I have no problem with this answer, as it no doubt works for some situations. Unfortunately, it didn't work for me with respect to some cloudfront-served pages with ?v=blah type versioning in them. There may be some cloudfront-specific way to request a document without these, I don't know, but I failed to find one, so something like one of the other answers may well be necessary in such a case. (If anyone knows of a way to strip - or get Cloudfront not to serve - the v= strings, I'd love to hear about it.)
  • Luis David
    Luis David over 3 years
    This should be the accepted answer as It does exactly what is required.
  • boldnik
    boldnik over 3 years
    Manual says: If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default. This option is useful for some file-downloading CGI programs that use "Content-Disposition" headers to describe what the name of a downloaded file should be. I don't think it's 100% reliable solution.
  • boldnik
    boldnik over 3 years
    minimalism in it's beauty :)