Multithreading for faster downloading
Solution 1
It looks to me like the consumer - producer problem - see wikipedia
You may use
import Queue, thread
# create a Queue.Queue here
queue = Queue.Queue()
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
queue.put(url) # produce
def thrad():
url = queue.get() # consume
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
thread.start_new(thrad, ()) # start 1 threads
Solution 2
I use multiprocessing
for parallelizing things -- for some reason I like it better than threading
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
import multiprocessing
print ("downloading and parsing Bibles...")
def download_stuff(link):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)
root = html.parse(open('links.html'))
links = root.findall('//a')
pool = multiprocessing.Pool(processes=5) #use 5 processes to download the data
output = pool.map(download_stuff,links) #output is a list of [None,None,...] since download_stuff doesn't return anything
Solution 3
In 2017 there are some other options now, like asyncio and ThreadPoolExecutor.
Here is an example of ThreadPoolExecutor (included in Python futures)
from concurrent.futures import ThreadPoolExecutor
def download(url, filename):
... your dowload function...
pass
with ThreadPoolExecutor(max_workers=12) as executor:
future = executor.submit(download, url, filename)
print(future.result())
submit() function will submit the task to a queue. (queue management is done for you)
Python version 3.5 and above:
if max_workers is None or not given, it will default to the number of processors on the
machine, multiplied by 5.
You can set max_workers, a few times the number of CPU cores in practice, do some tests to see how faw you can go up, depending on context-switching overhead.
For more info: https://docs.python.org/3/library/concurrent.futures.html
Blainer
Updated on June 04, 2022Comments
-
Blainer almost 2 years
How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script.
The Python script:
from BeautifulSoup import BeautifulSoup import lxml.html as html import urlparse import os, sys import urllib2 import re print ("downloading and parsing Bibles...") root = html.parse(open('links.html')) for link in root.findall('//a'): url = link.get('href') name = urlparse.urlparse(url).path.split('/')[-1] dirname = urlparse.urlparse(url).path.split('.')[-1] f = urllib2.urlopen(url) s = f.read() if (os.path.isdir(dirname) == 0): os.mkdir(dirname) soup = BeautifulSoup(s) articleTag = soup.html.body.article converted = str(articleTag) full_path = os.path.join(dirname, name) open(full_path, 'w').write(converted) print(name)
The HTML file called
links.html
:<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a> <a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a> <a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a> <a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
-
kindall almost 12 yearsYou haven't even tried anything yet, so you don't actually have a problem we can help with.
-
User almost 12 yearsopen(full_path, 'wb').write(converted) !!! you want to download binary files
-
-
Blainer almost 12 yearsI am getting this error
NameError: name 'queue' is not defined
and if I capitalize "Queue" I get this errorAttributeError: 'module' object has no attribute 'put'
-
mgilson almost 12 yearsDo you really want to have a function named
thread
when you've imported a module of the same name? -
User almost 12 yearsright. this was only a suggestion of structure and not a solution since he knows about multithreading.
-
mgilson almost 12 years@Blainer, The first line should probably read
from Queue import Queue
(for python 2.X) andfrom threading import Thread
, then the last line should beThread.start_new(thread,())
... I think. (I don't really usethreading
so I don't know for sure.) -
Mike Pennington almost 12 years@user1320237, we're here to solve problems. Handwaving and theoretical solutions are not going to help
-
Blainer almost 12 yearsI am getting this error
AssertionError: invalid Element proxy at 163319020
-
Blainer almost 12 yearsI am getting this
TypeError: unbound method put() must be called with Queue instance as first argument (got str instance instead)
-
User almost 12 years@blainer replace Queue.put with queue.put bye i have to leave
-
mgilson almost 12 years@Blainer : That's a little strange (although I don't know how lxml.html works, so maybe it isn't...). You could try passing the url's instead of the links. It's possible that some information lxml.html keeps some sort of proxy/handle on the position in the file which isn't being pickled/unpickled properly by
multiprocessing
. If link.get returns a string, that should play a little more nicely...