How can I make wget rename downloaded files to not include the query string?
Solution 1
If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling wget to listen to that header for the final filename is as simple as:
wget --content-disposition
You'll need a newish version of wget to use this feature.
I have no idea how well it handles a server claiming a filename of '/etc/passwd'.
Solution 2
I realized after processing a large batch that I should have instructed wget
to ignore the query strings. I did not want to do it over again so I made this script which worked for me:
# /bin/bash
for i in `find $1 -type f`
do
mv $i `echo $i | cut -d? -f1`
done
Put that in a file like rmqstr
and chmod +x rmqstr
Syntax: ./rmqstr <directory (defaults to .)>
It will remove the query strings from all filenames recursively.
Solution 3
I think, in order to get wget
to save as a filename different than the URL specifies, you need to use the -O filename
argument. That only does what you want when you give it a single URL -- with multiple URLs, all downloaded content ends up in filename
.
But that's really the answer. Instead of trying to do it all in one wget
command, use multiple commands. Now your workflow becomes:
- Run
wget
to get the base HTML file(s) containing your links; - Parse for URLs;
- Foreach URL ending in
mp3
,- process URL to get a filename (eg turn
http://foo/bar/baz.mp3?gargle=blaster
intobaz.mp3
- (optional) check that filename doesn't exist
- run
wget <URL> -O <filename>
- process URL to get a filename (eg turn
That solves your problem, but now you need to figure out how to grab the base files to find your mp3
URLs.
Do you have a particular site/base URL in mind? Steps 1 and 3 will be easier to handle with a concrete example.
Solution 4
I have a similar approach as @Gregory Wolf because his code always created error messages like this:
mv: './file' and './file' are the same file
Thus I first check if there is a query string in the filename before moving the file:
for f in $(find $1 -type f); do
if [ $f = ${f%%\?*} ]; then continue; fi
mv "${f}" "${f%%\?*}"
done
This will recursively check every file and remove all query strings in their filenames if available.
Solution 5
Look at these two commands I created to clone a site, and after clone is done, you can execute second command.
The second command will take a look in entire clone, search for "?" file pattern names, and will remove query string from the file name.
# Clone entire site.
wget --content-disposition --execute robots=off --recursive --no-parent --continue --no-clobber http://example.com
# Remove query string from a static resource.
for i in `find $1 -type f -name "*\?*"`; do mv $i `echo $i | cut -d? -f1`; done
(See it in GitHub Gist.)
Related videos on Youtube
Keith Twombley
Updated on September 17, 2022Comments
-
Keith Twombley over 1 year
I'm downloading a site with wget and a lot of the links have queries attached to them, so when I do this:
wget -nv -c -r -H -A mp3 -nd http://url.to.old.podcasts.com/
I end up with a lot of files like this:
1.mp3?foo=bar 2.mp3?blatz=pow 3.mp3?fizz=buzz
What I'd like to end up with is:
1.mp3 2.mp3 3.mp3
This is all taking place in ubuntu linux and I've got wget 1.10.2.
I know I can do this after I get everything via a script to rename everything. However I'd really like a solution from within wget so I can see the correct names as the download is happening.
Can anyone help me unravel this?
-
Deniz Zoeteman over 14 yearsPost your question at www.stackoverflow.com.
-
quack quixote over 14 years@TutorialPoint why? question is looking for a within-wget-way-to-do-it, SO would just migrate it back here.
-
Walter Kiess over 14 yearsWell, there is no within-wget-way-to-do-it
-
quack quixote over 14 years@ayrnieu: not in one command, no. and not without a helper. but you can certainly do it with as few as n+1
wget
commands (if not fewer).
-
-
Michael Mior over 9 yearsThis somewhat solves the issue of the filenames being displayed, but the OP also wants the final file name not to have the query string.
-
Ramhound over 8 yearsCan you please quote the relevant information from the link, so we know which material, you believe answers this question.
-
Arkadiusz 'flies' Rzadkowolski about 5 yearsI would add ` -name "\?"` to find part to limit only to needed files :)
-
No name about 5 yearsI have no problem with this answer, as it no doubt works for some situations. Unfortunately, it didn't work for me with respect to some cloudfront-served pages with
?v=blah
type versioning in them. There may be some cloudfront-specific way to request a document without these, I don't know, but I failed to find one, so something like one of the other answers may well be necessary in such a case. (If anyone knows of a way to strip - or get Cloudfront not to serve - thev=
strings, I'd love to hear about it.) -
Luis David over 3 yearsThis should be the accepted answer as It does exactly what is required.
-
boldnik over 3 yearsManual says:
If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default. This option is useful for some file-downloading CGI programs that use "Content-Disposition" headers to describe what the name of a downloaded file should be.
I don't think it's 100% reliable solution. -
boldnik over 3 yearsminimalism in it's beauty :)