Multithreading for faster downloading

python beautifulsoup lxml urllib2 urllib

13,789

Solution 1

It looks to me like the consumer - producer problem - see wikipedia

You may use

import Queue, thread

# create a Queue.Queue here
queue = Queue.Queue()

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  queue.put(url) # produce




def thrad():
  url = queue.get() # consume
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'wb').write(converted)
  print(name)

thread.start_new(thrad, ()) # start 1 threads

Solution 2

I use multiprocessing for parallelizing things -- for some reason I like it better than threading

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
import multiprocessing


print ("downloading and parsing Bibles...")
def download_stuff(link):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

root = html.parse(open('links.html'))
links = root.findall('//a')
pool = multiprocessing.Pool(processes=5) #use 5 processes to download the data
output = pool.map(download_stuff,links)  #output is a list of [None,None,...] since download_stuff doesn't return anything

Solution 3

In 2017 there are some other options now, like asyncio and ThreadPoolExecutor.

Here is an example of ThreadPoolExecutor (included in Python futures)

from concurrent.futures import ThreadPoolExecutor

def download(url, filename):
    ... your dowload function...
    pass

with ThreadPoolExecutor(max_workers=12) as executor:
    future = executor.submit(download, url, filename)
    print(future.result())

submit() function will submit the task to a queue. (queue management is done for you)

Python version 3.5 and above:
if max_workers is None or not given, it will default to the number of processors on the 
machine, multiplied by 5.

You can set max_workers, a few times the number of CPU cores in practice, do some tests to see how faw you can go up, depending on context-switching overhead.

For more info: https://docs.python.org/3/library/concurrent.futures.html

13,789

Author by

Blainer

Updated on June 04, 2022

Comments

Blainer almost 2 years

How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script.

The Python script:

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

The HTML file called links.html:

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>

kindall almost 12 years

You haven't even tried anything yet, so you don't actually have a problem we can help with.
User almost 12 years

open(full_path, 'wb').write(converted) !!! you want to download binary files

Blainer almost 12 years

I am getting this error NameError: name 'queue' is not defined and if I capitalize "Queue" I get this error AttributeError: 'module' object has no attribute 'put'
mgilson almost 12 years

Do you really want to have a function named thread when you've imported a module of the same name?
User almost 12 years

right. this was only a suggestion of structure and not a solution since he knows about multithreading.
mgilson almost 12 years

@Blainer, The first line should probably read from Queue import Queue (for python 2.X) and from threading import Thread, then the last line should be Thread.start_new(thread,()) ... I think. (I don't really use threading so I don't know for sure.)
Mike Pennington almost 12 years

@user1320237, we're here to solve problems. Handwaving and theoretical solutions are not going to help
Blainer almost 12 years

I am getting this error AssertionError: invalid Element proxy at 163319020
Blainer almost 12 years

I am getting this TypeError: unbound method put() must be called with Queue instance as first argument (got str instance instead)
User almost 12 years

@blainer replace Queue.put with queue.put bye i have to leave
mgilson almost 12 years

@Blainer : That's a little strange (although I don't know how lxml.html works, so maybe it isn't...). You could try passing the url's instead of the links. It's possible that some information lxml.html keeps some sort of proxy/handle on the position in the file which isn't being pickled/unpickled properly by multiprocessing. If link.get returns a string, that should play a little more nicely...