How to download a file using python in a 'smarter' way?

96,535

Solution 1

Download scripts like that tend to push a header telling the user-agent what to name the file:

Content-Disposition: attachment; filename="the filename.ext"

If you can grab that header, you can get the proper filename.

There's another thread that has a little bit of code to offer up for Content-Disposition-grabbing.

remotefile = urllib2.urlopen('http://example.com/somefile.zip')
remotefile.info()['Content-Disposition']

Solution 2

Based on comments and @Oli's anwser, I made a solution like this:

from os.path import basename
from urlparse import urlsplit

def url2name(url):
    return basename(urlsplit(url)[2])

def download(url, localFileName = None):
    localName = url2name(url)
    req = urllib2.Request(url)
    r = urllib2.urlopen(req)
    if r.info().has_key('Content-Disposition'):
        # If the response has Content-Disposition, we take file name from it
        localName = r.info()['Content-Disposition'].split('filename=')[1]
        if localName[0] == '"' or localName[0] == "'":
            localName = localName[1:-1]
    elif r.url != url: 
        # if we were redirected, the real file name we take from the final URL
        localName = url2name(r.url)
    if localFileName: 
        # we can force to save the file as specified name
        localName = localFileName
    f = open(localName, 'wb')
    f.write(r.read())
    f.close()

It takes file name from Content-Disposition; if it's not present, uses filename from the URL (if redirection happened, the final URL is taken into account).

Solution 3

Combining much of the above, here is a more pythonic solution:

import urllib2
import shutil
import urlparse
import os

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            # If the response has Content-Disposition, try to get filename from it
            cd = dict(map(
                lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
                openUrl.info()['Content-Disposition'].split(';')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        # if no filename was found above, parse it out of the final URL.
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName or getFileName(url,r)
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()

Solution 4

2 Kender:

if localName[0] == '"' or localName[0] == "'":
    localName = localName[1:-1]

it is not safe -- web server can pass wrong formatted name as ["file.ext] or [file.ext'] or even be empty and localName[0] will raise exception. Correct code can looks like this:

localName = localName.replace('"', '').replace("'", "")
if localName == '':
    localName = SOME_DEFAULT_FILE_NAME
Share:
96,535
kender
Author by

kender

Programmer by heart, currently freelance developer on mostly iOS applications and python web projects. Basic field of interest: python coding, obj-c, iOS, java, distributed databases, bussiness usage of python :) Also interested in usability and integration testing.

Updated on July 05, 2022

Comments

  • kender
    kender almost 2 years

    I need to download several files via http in Python.

    The most obvious way to do it is just using urllib2:

    import urllib2
    u = urllib2.urlopen('http://server.com/file.html')
    localFile = open('file.html', 'w')
    localFile.write(u.read())
    localFile.close()
    

    But I'll have to deal with the URLs that are nasty in some way, say like this: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf. When downloaded via the browser, the file has a human-readable name, ie. accounts.pdf.

    Is there any way to handle that in python, so I don't need to know the file names and hardcode them into my script?

  • Oli
    Oli about 15 years
    No, they might be redirecting to a plain file. But if it's like most download scripts, they're pushing the content-disposition. By all means check.
  • kender
    kender about 15 years
    If it redirects me to a plain file it's easy too, I can access actual url via remotefile.url, can't I?
  • u0b34a0f6ae
    u0b34a0f6ae almost 15 years
    I found this useful. But to download bigger files, without storing them full content in memory, I had to find out this, copying your 'r' to 'f': import shutil shutil.copyfileobj(r, f)
  • fjsj
    fjsj about 12 years
    Worked very well, but I would wrap urlsplit(url)[2] with a call to urllib.unquote, otherwise the filenames would be percent-encoded. Here is how I'm doing: return basename(urllib.unquote(urlsplit(url)[2]))
  • koniiiik
    koniiiik about 10 years
    Even better: local_name.strip('\'"') -- that will only strip from the beginning and end and is also more succinct.