Python and BeautifulSoup Opening pages

24,339

Solution 1

I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:

import requests
from bs4 import BeautifulSoup

r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.

Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.

If you then want to get the HTML for each of the links, the following could be used:

import requests
from bs4 import BeautifulSoup

# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print '-' * 60      # Add a line of dashes
    print 'href: ', a_tag['href']
    request_href = requests.get(base_url + a_tag['href'])
    print request_href.content

Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.

Solution 2

  1. I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.

  2. You might need to find the attributes of the "href" link itself: You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:

from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
  for link in item.find_all("a"):
    if 'href' in link.attrs:
        a = link.attrs['href']
        print(a)
        print("")

I did this and I was able to get into another link which was embedded in the home page

Share:
24,339
Brendan Cott
Author by

Brendan Cott

Updated on July 09, 2022

Comments

  • Brendan Cott
    Brendan Cott almost 2 years

    I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?

    Here is my code:

    # coding: utf-8
    
    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get("")
    soup = BeautifulSoup(r.content)
    soup.find_all("a")
    
    for link in soup.find_all("a"):
        print link.get("href")
    
        for link in soup.find_all("a"):
            print link.text
    
        for link in soup.find_all("a"):
            print link.text, link.get("href")
    
        g_data = soup.find_all("div", {"class":"listing__left-column"})
    
        for item in g_data:
            print item.contents
    
        for item in g_data:
            print item.contents[0].text
            print link.get('href')
    
        for item in g_data:
            print item.contents[0]
    

    I am trying to collect the href's from the titles of each business, and then open them and scrape that data.

    • Remi Guan
      Remi Guan over 8 years
      First, I don't understand what are you asking. Then, maybe you'd like see the document .
    • Martin Evans
      Martin Evans over 8 years
      You will need to let us know which page you wish to scrape. Something like r = requests.get("http://www.yellowpages.com/") will be needed.
    • Brendan Cott
      Brendan Cott over 8 years
      I should of explained it more, what I am wanting to do is open a href inside a div ect. puu.sh/kmgxZ/15fc324654.png I want to call each href that has a link and open them pages to then start to scrap
  • Brendan Cott
    Brendan Cott over 8 years
    Ok so from a long read it finds out I am wanting to open href's or classes using something that can do that. I was told Requests can do this. So if I get Requests to open a href on that page and then scrap that page with BS it will work
  • Martin Evans
    Martin Evans over 8 years
    The first request call is used to get some HTML for BeautifulSoup to parse. It then display the hrefs. If you want to get the HTML from each of these, you can use the update I added to my answer.
  • Brendan Cott
    Brendan Cott over 8 years
    Thank You, Their is not much on scraping site and their pages in side pages. Just a lot of tutorials of scraping one page. What book or tutorial series would you recommend for Python and Scraping.
  • Martin Evans
    Martin Evans over 8 years
    The most important thing to understand is how HTML is structured. You would then know what to look for.
  • Brendan Cott
    Brendan Cott over 8 years
    Hello Martin, I now have gotten my HTML and extracted the data but now am looking at ways to use beautifulsoup, Can we have multiple classes with BS attribute. For instance I have turned my request_href.content into a variable and want to now extract content from that. I can see I can not add such things like newpage.findAll ect
  • Martin Evans
    Martin Evans over 8 years
    I suggest you work through the whole Beautifulsoup tutorial it explains everything. You might want to click the tick by my answer, you could then raise a new question.
  • zero
    zero almost 3 years
    hi there - where do you put the URL !?
  • zero
    zero almost 3 years
    well - Martin - if i run the code i get errors ` runfile('/home/martin/dev/untitled1.py', wdir='/home/martin/dev') File "/home/martin/dev/untitled1.py", line 17 print 'href: ', a_tag['href'] ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print('href: ', a_tag['href'])?`
  • Martin Evans
    Martin Evans almost 3 years
    Correct, in Python 3.x you do need to add parentheses to the print statements, the question though was originally asked using Python 2.x
  • zero
    zero almost 3 years
    thx for all the great ideas. regarding the detection one is trying to access a site via a script, couldnt we work here with sleep() function: Step 1: import time; Step 2: Add time.sleep() - what do you think?!
  • Martin Evans
    Martin Evans almost 3 years
    Definitely worth a try, it really depends on the website
  • temi
    temi over 2 years
    r = requests.get("<your URL here>")