Scraping text in h3 and div tags using beautifulSoup, Python

21,127

Solution 1

You can use CSS selectors to find the data you need. In your case div > h3 ~ div will find all div elements that are directly inside a div element and are proceeded by a h3 element.

import bs4

page= """
<div class="box effect">
<div class="row">
<div class="col-lg-10">
    <h3>HEADING</h3>
    <div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>
    <div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>
    <div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>
    <div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>
</div>
</div>
</div>
"""

soup = bs4.BeautifulSoup(page, 'lxml')

# find all div elements that are inside a div element
# and are proceeded by an h3 element
selector = 'div > h3 ~ div'

# find elements that contain the data we want
found = soup.select(selector)

# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]

for x in data:
    print(x)

Edit: To scrape the text in heading..

heading = soup.find('h3') 
heading_data = heading.text
print(heading_data)

Edit: Or you can get the heading and other div elements at once by using a selector like this: div.col-lg-10 > *. This finds all elements inside a div element that belongs to col-lg-10 class.

soup = bs4.BeautifulSoup(page, 'lxml')

# find all elements inside a div element of class col-lg-10
selector = 'div.col-lg-10 > *'

# find elements that contain the data we want
found = soup.select(selector)

# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]

for x in data:
    print(x)

Solution 2

So it seemed quite nice:

    #  -*- coding: utf-8 -*-
    # by Faguiro #
    # run using Python 3.8.6  on Linux#
    import requests
    from bs4 import BeautifulSoup

    # insert your site here
    url= input("Enter the url-->")

    #use requests
    r = requests.get(url)
    content = r.content

    #soup!
    soup = BeautifulSoup(content, "html.parser")

    #find all tag in the soup.
    heading = soup.find_all("h3")

    #print(heading) <--- result...

    #...ptonic organization!
    n=len(heading)
    for x in range(n): 
        print(str.strip(heading[x].text))

Dependencies: On terminal (linux):

sudo apt-get install python3-bs4

Share:
21,127
Revaapriyan
Author by

Revaapriyan

Updated on December 05, 2020

Comments

  • Revaapriyan
    Revaapriyan over 3 years

    I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data).

    <div class="box effect">
    <div class="row">
    <div class="col-lg-10">
        <h3>HEADING</h3>
            <div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>
            <div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>
            <div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>
            <div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>
        <div class="space">&nbsp;</div>
    
    <div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i> &nbsp;more info</a></div>
    
    </div>
    <div class="col-lg-2">
    
    </div>
    </div>
    </div>
    

    The output I need is Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS

    I found those data don't have a id or class yet being in website as general text. I'm trying BeautifulSoup and Python Selenium separately for that, where I got stuck to extract in both the methods as no tutorials I saw, guided me to extract text from these and tags

    My code using BeautifulSoup

    import urllib2
    from bs4 import BeautifulSoup
    import requests
    import csv
    
    MAX = 2
    
    '''with open("lg.csv", "a") as f:
      w=csv.writer(f)'''
    ##for i in range(1,MAX+1)
    url="http://www.example_site.com"
    
    page=requests.get(url)
    soup = BeautifulSoup(page.content,"html.parser")
    
    for h in soup.find_all('h3'):
        print(h.get('h3'))
    

    My selenium code

    import csv
    from selenium import webdriver
    MAX_PAGE_NUM = 2
    driver = webdriver.Firefox()
    for i in range(1, MAX_PAGE_NUM+1):
      url = "http://www.example_site.com"
      driver.get(url)
      name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3')
      #contact = driver.find_elements_by_xpath('//span[@class="item-price"]')
    #  phone = 
    #  mobile = 
    #  address =
    #  print(len(buyers))
    #  num_page_items = len(buyers)
    #  with open('res.csv','a') as f:
    #    for i in range(num_page_items):
    #      f.write(buyers[i].text + "," + prices[i].text + "\n")
      print (name)          
    driver.close()