BeautifulSoup: extract text from anchor tag

144,353

Solution 1

All the above answers really help me to construct my answer, because of this I voted for all the answers that other users put it out: But I finally put together my own answer to exact problem I was dealing with:

As question clearly defined I had to access some of the siblings and its children in a dom structure: This solution will iterate over the images in the dom structure and construct image name using product title and save the image to the local directory.

import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests

def getImages(url):
    #Download the images
    r = requests.get(url)
    html = r.text
    soup = bs(html)
    output_folder = '~/amazon'
    #extracting the images that in div(s)
    for div in soup.findAll('div', attrs={'class':'image'}):
        modified_file_name = None
        try:
            #getting the data div using findNext
            nextDiv =  div.findNext('div', attrs={'class':'data'})
            #use findNext again on previous object to get to the anchor tag
            fileName = nextDiv.findNext('a').text
            modified_file_name = fileName.replace(' ','-') + '.jpg'
        except TypeError:
            print 'skip'
        imageUrl = div.find('img')['src']
        outputPath = os.path.join(output_folder, modified_file_name)
        urlretrieve(imageUrl, outputPath)

if __name__=='__main__':
    url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
    getImages(url)

Solution 2

This will help:

from bs4 import BeautifulSoup

data = '''<div class="image">
        <a href="http://www.example.com/eg1">Content1<img  
        src="http://image.example.com/img1.jpg" /></a>
        </div>
        <div class="image">
        <a href="http://www.example.com/eg2">Content2<img  
        src="http://image.example.com/img2.jpg" /> </a>
        </div>'''

soup = BeautifulSoup(data)

for div in soup.findAll('div', attrs={'class':'image'}):
    print(div.find('a')['href'])
    print(div.find('a').contents[0])
    print(div.find('img')['src'])

If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.

Solution 3

In my case, it worked like that:

from BeautifulSoup import BeautifulSoup as bs

url="http://blabla.com"

soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
        print link.string

Hope it helps!

Solution 4

I would suggest going the lxml route and using xpath.

from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')

Solution 5

print(link_addres.contents[0])

It will print the context of the anchor tags

example:

 statement_title = statement.find('h2',class_='briefing-statement__title')
 statement_title_text = statement_title.a.contents[0]
Share:
144,353

Related videos on Youtube

add-semi-colons
Author by

add-semi-colons

Find missing Semicolons;

Updated on July 09, 2022

Comments

  • add-semi-colons
    add-semi-colons almost 2 years

    I want to extract:

    • text from following src of the image tag and
    • text of the anchor tag which is inside the div class data

    I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.

    <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 
    

    Here is the link for the entire HTML page.

    Here is my code:

    for div in soup.findAll('div', attrs={'class':'image'}):
        print "\n"
        for data in div.findNextSibling('div', attrs={'class':'data'}):
            for a in data.findAll('a', attrs={'class':'title'}):
                print a.text
        for img in div.findAll('img'):
            print img['src']
    

    What I am trying to do is extract the image src (link) and the title inside the div class=data, so for example:

     <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 
    

    should extract:

    Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

    • add-semi-colons
      add-semi-colons almost 12 years
      I am looking for the text not text of href
    • add-semi-colons
      add-semi-colons almost 12 years
      did you mean like this print div.find('a').string then I get None
    • Jon Clements
      Jon Clements almost 12 years
      Posted an answer that does what you want... (I think) - but would personally go for the lxml.html parser route.
  • add-semi-colons
    add-semi-colons almost 12 years
    I am actually looking for the text example: <a class="title" href="rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> in this I want the Nikon COOLPIX... if I used div.find('a')['href'] that only gives the href not the text.
  • add-semi-colons
    add-semi-colons almost 12 years
    I am getting following error: 'module' object has no attribute 'html' looks like etree doesn't have html object to be called upon.
  • add-semi-colons
    add-semi-colons almost 12 years
    so I tried like this but print div.find('a', {'class':'title'}).string got the error: print div.find('a', {'class':'title'}).string AttributeError: 'NoneType' object has no attribute 'string'
  • Jon Clements
    Jon Clements almost 12 years
    Then it's blank - try/except the block
  • Justin Fay
    Justin Fay almost 12 years
    Thats because I had a typo the line should be: data = etree.HTML(data) I have updated the original answer.
  • smci
    smci over 5 years
    @gauden: python-amazon-product-api is only available for Python 2.x. Do you or anyone have a better package recommendation?
  • daedalus
    daedalus over 5 years
    @smci I haven’t used this for a long while. The Amazon Product Advertising API has a REST interface that seems easy enough to interact with using a library like Requests.
  • rosstex
    rosstex over 4 years
    This is close, but it should be updated to convert link.string from a NavigableString to a normal string.