Using BeautifulSoup to extract the title of a link

23,456

Solution 1

Well, it seems you have put two spaces between s-access-detail-page and a-text-normal, which in turn, is not able to find any matching link. Try with correct number of spaces, then printing number of links found. Also, you can print the tag itself - print link

import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.in/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=python"
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "lxml")
links = soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'})
print len(links)
for link in links:
    title = link.get('title')
    print title

Solution 2

You are searching for an exact string here, by using multiple classes. In that case the class string has to match exactly, with single spaces.

See the Searching by CSS class section in the documentation:

You can also search for the exact string value of the class attribute:

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

But searching for variants of the string value won’t work:

css_soup.find_all("p", class_="strikeout body")
# []

You'd have a better time searching for individual classes:

soup.find_all('a', class_='a-link-normal')

If you must match more than one class, use a CSS selector:

soup.select('a.a-link-normal.s-access-detail-page.a-text-normal')

and it won't matter in what order you list the classes.

Demo:

>>> from bs4 import BeautifulSoup
>>> plain_text = u'<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>'
>>> soup = BeautifulSoup(plain_text)
>>> for link in soup.find_all('a', class_='a-link-normal'):
...     print link.text
... 
Introduction To Computation And Programming Using Python
>>> for link in soup.select('a.a-link-normal.s-access-detail-page.a-text-normal'):
...     print link.text
... 
Introduction To Computation And Programming Using Python
Share:
23,456
Manas Chaturvedi
Author by

Manas Chaturvedi

Software Engineer 2 at Haptik

Updated on July 09, 2022

Comments

  • Manas Chaturvedi
    Manas Chaturvedi almost 2 years

    I'm trying to extract the title of a link using BeautifulSoup. The code that I'm working with is as follows:

    url = "http://www.example.com"
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "lxml")
    for link in soup.findAll('a', {'class': 'a-link-normal s-access-detail-page  a-text-normal'}):
        title = link.get('title')
        print title
    

    Now, an example link element contains the following:

    <a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>
    

    However, nothing gets displayed after I run the above code. How can I extract the value stored inside the title attribute of the anchor tag stored in link?

  • Manas Chaturvedi
    Manas Chaturvedi over 8 years
    print link outputs the above link value that I mentioned in my original post. The class name is indeed correct and is able to find matching links. But I can't seem to extract the value inside the title attribute from link.
  • Vikas Ojha
    Vikas Ojha over 8 years
    Please try replacing the .text with .content, i.e., plain_text = source_code.content. Also, could you post a sample url?
  • Manas Chaturvedi
    Manas Chaturvedi over 8 years
    This is the URL I'm working with: http://www.amazon.in/s/ref=nb_sb_noss?url=search-alias%3Daps‌​&field-keywords=pyth‌​on
  • Vikas Ojha
    Vikas Ojha over 8 years
    The links with class is not there in the html source of the page.
  • Manas Chaturvedi
    Manas Chaturvedi over 8 years
    Try inspecting the elements which contain the titles of the books.
  • Vikas Ojha
    Vikas Ojha over 8 years
    I just edited my answer, and the above code is working perfectly fine. Hope that helps.