Python urllib2.urlopen() is slow, need a better way to read several urls
Solution 1
I'm rewriting Dumb Guy's code below using modern Python modules like threading
and Queue
.
import threading, urllib2
import Queue
urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]
def read_url(url, queue):
data = urllib2.urlopen(url).read()
print('Fetched %s from %s' % (len(data), url))
queue.put(data)
def fetch_parallel():
result = Queue.Queue()
threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
for t in threads:
t.start()
for t in threads:
t.join()
return result
def fetch_sequencial():
result = Queue.Queue()
for url in urls_to_load:
read_url(url,result)
return result
Best time for find_sequencial()
is 2s. Best time for fetch_parallel()
is 0.9s.
Also it is incorrect to say thread
is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.
Solution 2
Edit: Please take a look at Wai's post for a better version of this code. Note that there is nothing wrong with this code and it will work properly, despite the comments below.
The speed of reading web pages is probably bounded by your Internet connection, not Python.
You could use threads to load them all at once.
import thread, time, urllib
websites = {}
def read_url(url):
websites[url] = urllib.open(url).read()
for url in urls_to_load: thread.start_new_thread(read_url, (url,))
while websites.keys() != urls_to_load: time.sleep(0.1)
# Now websites will contain the contents of all the web pages in urls_to_load
Solution 3
It is maby not perfect. But when I need the data from a site. I just do this:
import socket
def geturldata(url):
#NO HTTP URLS PLEASE!!!!!
server = url.split("/")[0]
args = url.replace(server,"")
returndata = str()
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((server, 80)) #lets connect :p
s.send("GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n" % (args, server)) #simple http request
while 1:
data = s.recv(1024) #buffer
if not data: break
returndata = returndata + data
s.close()
return returndata.split("\n\r")[1]
Solution 4
Scrapy might be useful for you. If you don't need all of its functionality, you might just use twisted's twisted.web.client.getPage
instead. Asynchronous IO in one thread is going to be way more performant and easy to debug than anything that uses multiple threads and blocking IO.
Solution 5
As a general rule, a given construct in any language is not slow until it is measured.
In Python, not only do timings often run counter to intuition but the tools for measuring execution time are exceptionally good.
Admin
Updated on September 01, 2020Comments
-
Admin over 3 years
As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.
As I have to read 5-10 sites, the page takes a while to load.
I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?
Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes