Multithreading for faster downloading

13,789

Solution 1

It looks to me like the consumer - producer problem - see wikipedia

You may use

import Queue, thread

# create a Queue.Queue here
queue = Queue.Queue()

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  queue.put(url) # produce




def thrad():
  url = queue.get() # consume
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'wb').write(converted)
  print(name)

thread.start_new(thrad, ()) # start 1 threads

Solution 2

I use multiprocessing for parallelizing things -- for some reason I like it better than threading

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
import multiprocessing


print ("downloading and parsing Bibles...")
def download_stuff(link):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

root = html.parse(open('links.html'))
links = root.findall('//a')
pool = multiprocessing.Pool(processes=5) #use 5 processes to download the data
output = pool.map(download_stuff,links)  #output is a list of [None,None,...] since download_stuff doesn't return anything

Solution 3

In 2017 there are some other options now, like asyncio and ThreadPoolExecutor.

Here is an example of ThreadPoolExecutor (included in Python futures)

from concurrent.futures import ThreadPoolExecutor

def download(url, filename):
    ... your dowload function...
    pass

with ThreadPoolExecutor(max_workers=12) as executor:
    future = executor.submit(download, url, filename)
    print(future.result())

submit() function will submit the task to a queue. (queue management is done for you)

Python version 3.5 and above:
if max_workers is None or not given, it will default to the number of processors on the 
machine, multiplied by 5.

You can set max_workers, a few times the number of CPU cores in practice, do some tests to see how faw you can go up, depending on context-switching overhead.

For more info: https://docs.python.org/3/library/concurrent.futures.html

Share:
13,789
Blainer
Author by

Blainer

Updated on June 04, 2022

Comments

  • Blainer
    Blainer almost 2 years

    How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script.

    The Python script:

    from BeautifulSoup import BeautifulSoup
    import lxml.html as html
    import urlparse
    import os, sys
    import urllib2
    import re
    
    print ("downloading and parsing Bibles...")
    root = html.parse(open('links.html'))
    for link in root.findall('//a'):
      url = link.get('href')
      name = urlparse.urlparse(url).path.split('/')[-1]
      dirname = urlparse.urlparse(url).path.split('.')[-1]
      f = urllib2.urlopen(url)
      s = f.read()
      if (os.path.isdir(dirname) == 0): 
        os.mkdir(dirname)
      soup = BeautifulSoup(s)
      articleTag = soup.html.body.article
      converted = str(articleTag)
      full_path = os.path.join(dirname, name)
      open(full_path, 'w').write(converted)
      print(name)
    

    The HTML file called links.html:

    <a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>
    
    <a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>
    
    <a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>
    
    <a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
    
    • kindall
      kindall almost 12 years
      You haven't even tried anything yet, so you don't actually have a problem we can help with.
    • User
      User almost 12 years
      open(full_path, 'wb').write(converted) !!! you want to download binary files
  • Blainer
    Blainer almost 12 years
    I am getting this error NameError: name 'queue' is not defined and if I capitalize "Queue" I get this error AttributeError: 'module' object has no attribute 'put'
  • mgilson
    mgilson almost 12 years
    Do you really want to have a function named thread when you've imported a module of the same name?
  • User
    User almost 12 years
    right. this was only a suggestion of structure and not a solution since he knows about multithreading.
  • mgilson
    mgilson almost 12 years
    @Blainer, The first line should probably read from Queue import Queue (for python 2.X) and from threading import Thread, then the last line should be Thread.start_new(thread,()) ... I think. (I don't really use threading so I don't know for sure.)
  • Mike Pennington
    Mike Pennington almost 12 years
    @user1320237, we're here to solve problems. Handwaving and theoretical solutions are not going to help
  • Blainer
    Blainer almost 12 years
    I am getting this error AssertionError: invalid Element proxy at 163319020
  • Blainer
    Blainer almost 12 years
    I am getting this TypeError: unbound method put() must be called with Queue instance as first argument (got str instance instead)
  • User
    User almost 12 years
    @blainer replace Queue.put with queue.put bye i have to leave
  • mgilson
    mgilson almost 12 years
    @Blainer : That's a little strange (although I don't know how lxml.html works, so maybe it isn't...). You could try passing the url's instead of the links. It's possible that some information lxml.html keeps some sort of proxy/handle on the position in the file which isn't being pickled/unpickled properly by multiprocessing. If link.get returns a string, that should play a little more nicely...