Beautiful Soup to parse url to get another urls data
Solution 1
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
print anchor['href']
It will give you the list of urls. Now You can iterate over those urls and parse the data.
-
inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.
Solution 2
For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
To use in Python...
import bs4 as BeautifulSoup
Solution 3
Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com
Edit:
Recent discovery: Using BeautifulSoup through lxml with
from lxml.html.soupparser import fromstring
is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.
dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]
Solution 4
FULL PYTHON 3 EXAMPLE
Packages
# pip3 install urllib
# pip3 install beautifulsoup4
Example:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://www.wikipedia.org/') as f:
data = f.read().decode('utf-8')
d = BeautifulSoup(data)
d.title.string
The above should print out 'Wikipedia'
tim
Updated on April 29, 2020Comments
-
tim about 4 years
I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.
Basically:
example.com/events/ <a href="http://example.com/events/1">Event 1</a> <a href="http://example.com/events/2">Event 2</a> example.com/events/1 ...some detail stuff I need example.com/events/2 ...some detail stuff I need