How to download a file using python in a 'smarter' way?
Solution 1
Download scripts like that tend to push a header telling the user-agent what to name the file:
Content-Disposition: attachment; filename="the filename.ext"
If you can grab that header, you can get the proper filename.
There's another thread that has a little bit of code to offer up for Content-Disposition
-grabbing.
remotefile = urllib2.urlopen('http://example.com/somefile.zip')
remotefile.info()['Content-Disposition']
Solution 2
Based on comments and @Oli's anwser, I made a solution like this:
from os.path import basename
from urlparse import urlsplit
def url2name(url):
return basename(urlsplit(url)[2])
def download(url, localFileName = None):
localName = url2name(url)
req = urllib2.Request(url)
r = urllib2.urlopen(req)
if r.info().has_key('Content-Disposition'):
# If the response has Content-Disposition, we take file name from it
localName = r.info()['Content-Disposition'].split('filename=')[1]
if localName[0] == '"' or localName[0] == "'":
localName = localName[1:-1]
elif r.url != url:
# if we were redirected, the real file name we take from the final URL
localName = url2name(r.url)
if localFileName:
# we can force to save the file as specified name
localName = localFileName
f = open(localName, 'wb')
f.write(r.read())
f.close()
It takes file name from Content-Disposition; if it's not present, uses filename from the URL (if redirection happened, the final URL is taken into account).
Solution 3
Combining much of the above, here is a more pythonic solution:
import urllib2
import shutil
import urlparse
import os
def download(url, fileName=None):
def getFileName(url,openUrl):
if 'Content-Disposition' in openUrl.info():
# If the response has Content-Disposition, try to get filename from it
cd = dict(map(
lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
openUrl.info()['Content-Disposition'].split(';')))
if 'filename' in cd:
filename = cd['filename'].strip("\"'")
if filename: return filename
# if no filename was found above, parse it out of the final URL.
return os.path.basename(urlparse.urlsplit(openUrl.url)[2])
r = urllib2.urlopen(urllib2.Request(url))
try:
fileName = fileName or getFileName(url,r)
with open(fileName, 'wb') as f:
shutil.copyfileobj(r,f)
finally:
r.close()
Solution 4
2 Kender:
if localName[0] == '"' or localName[0] == "'":
localName = localName[1:-1]
it is not safe -- web server can pass wrong formatted name as ["file.ext] or [file.ext'] or even be empty and localName[0] will raise exception. Correct code can looks like this:
localName = localName.replace('"', '').replace("'", "")
if localName == '':
localName = SOME_DEFAULT_FILE_NAME
kender
Programmer by heart, currently freelance developer on mostly iOS applications and python web projects. Basic field of interest: python coding, obj-c, iOS, java, distributed databases, bussiness usage of python :) Also interested in usability and integration testing.
Updated on July 05, 2022Comments
-
kender almost 2 years
I need to download several files via http in Python.
The most obvious way to do it is just using urllib2:
import urllib2 u = urllib2.urlopen('http://server.com/file.html') localFile = open('file.html', 'w') localFile.write(u.read()) localFile.close()
But I'll have to deal with the URLs that are nasty in some way, say like this:
http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf
. When downloaded via the browser, the file has a human-readable name, ie.accounts.pdf
.Is there any way to handle that in python, so I don't need to know the file names and hardcode them into my script?
-
Oli about 15 yearsNo, they might be redirecting to a plain file. But if it's like most download scripts, they're pushing the content-disposition. By all means check.
-
kender about 15 yearsIf it redirects me to a plain file it's easy too, I can access actual url via remotefile.url, can't I?
-
u0b34a0f6ae almost 15 yearsI found this useful. But to download bigger files, without storing them full content in memory, I had to find out this, copying your 'r' to 'f': import shutil shutil.copyfileobj(r, f)
-
fjsj about 12 yearsWorked very well, but I would wrap
urlsplit(url)[2]
with a call tourllib.unquote
, otherwise the filenames would be percent-encoded. Here is how I'm doing:return basename(urllib.unquote(urlsplit(url)[2]))
-
koniiiik about 10 yearsEven better:
local_name.strip('\'"')
-- that will only strip from the beginning and end and is also more succinct.