Beautiful Soup to parse url to get another urls data

57,465

Solution 1

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

It will give you the list of urls. Now You can iterate over those urls and parse the data.

  • inner_div = soup.findAll("div", {"id": "y-shade"}) This is an example. You can go through the BeautifulSoup tutorials.

Solution 2

For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

To use in Python...

import bs4 as BeautifulSoup

Solution 3

Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com

Edit:

Recent discovery: Using BeautifulSoup through lxml with

from lxml.html.soupparser import fromstring

is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

Solution 4

FULL PYTHON 3 EXAMPLE

Packages

# pip3 install urllib
# pip3 install beautifulsoup4

Example:

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('https://www.wikipedia.org/') as f:
    data = f.read().decode('utf-8')

d = BeautifulSoup(data)

d.title.string

The above should print out 'Wikipedia'

Share:
57,465
tim
Author by

tim

Updated on April 29, 2020

Comments

  • tim
    tim about 4 years

    I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.

    Basically:

    example.com/events/
        <a href="http://example.com/events/1">Event 1</a>
        <a href="http://example.com/events/2">Event 2</a>
    
    example.com/events/1
        ...some detail stuff I need
    
    example.com/events/2
        ...some detail stuff I need