Scraping text in h3 and div tags using beautifulSoup, Python
Solution 1
You can use CSS selectors to find the data you need.
In your case div > h3 ~ div
will find all div
elements that are directly inside a div
element and are proceeded by a h3
element.
import bs4
page= """
<div class="box effect">
<div class="row">
<div class="col-lg-10">
<h3>HEADING</h3>
<div><i class="fa user"></i> NAME</div>
<div><i class="fa phone"></i> MOBILE</div>
<div><i class="fa mobile-phone fa-2"></i> NUMBER</div>
<div><i class="fa address"></i> XYZ_ADDRESS</div>
</div>
</div>
</div>
"""
soup = bs4.BeautifulSoup(page, 'lxml')
# find all div elements that are inside a div element
# and are proceeded by an h3 element
selector = 'div > h3 ~ div'
# find elements that contain the data we want
found = soup.select(selector)
# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]
for x in data:
print(x)
Edit: To scrape the text in heading..
heading = soup.find('h3')
heading_data = heading.text
print(heading_data)
Edit: Or you can get the heading and other div elements at once by using a selector like this: div.col-lg-10 > *
. This finds all elements inside a div
element that belongs to col-lg-10
class.
soup = bs4.BeautifulSoup(page, 'lxml')
# find all elements inside a div element of class col-lg-10
selector = 'div.col-lg-10 > *'
# find elements that contain the data we want
found = soup.select(selector)
# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]
for x in data:
print(x)
Solution 2
So it seemed quite nice:
# -*- coding: utf-8 -*-
# by Faguiro #
# run using Python 3.8.6 on Linux#
import requests
from bs4 import BeautifulSoup
# insert your site here
url= input("Enter the url-->")
#use requests
r = requests.get(url)
content = r.content
#soup!
soup = BeautifulSoup(content, "html.parser")
#find all tag in the soup.
heading = soup.find_all("h3")
#print(heading) <--- result...
#...ptonic organization!
n=len(heading)
for x in range(n):
print(str.strip(heading[x].text))
Dependencies: On terminal (linux):
sudo apt-get install python3-bs4
Revaapriyan
Updated on December 05, 2020Comments
-
Revaapriyan over 3 years
I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data).
<div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i> NAME</div> <div><i class="fa phone"></i> MOBILE</div> <div><i class="fa mobile-phone fa-2"></i> NUMBER</div> <div><i class="fa address"></i> XYZ_ADDRESS</div> <div class="space"> </div> <div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i> more info</a></div> </div> <div class="col-lg-2"> </div> </div> </div>
The output I need is
Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS
I found those data don't have a id or class yet being in website as general text. I'm trying BeautifulSoup and Python Selenium separately for that, where I got stuck to extract in both the methods as no tutorials I saw, guided me to extract text from these and tags
My code using BeautifulSoup
import urllib2 from bs4 import BeautifulSoup import requests import csv MAX = 2 '''with open("lg.csv", "a") as f: w=csv.writer(f)''' ##for i in range(1,MAX+1) url="http://www.example_site.com" page=requests.get(url) soup = BeautifulSoup(page.content,"html.parser") for h in soup.find_all('h3'): print(h.get('h3'))
My selenium code
import csv from selenium import webdriver MAX_PAGE_NUM = 2 driver = webdriver.Firefox() for i in range(1, MAX_PAGE_NUM+1): url = "http://www.example_site.com" driver.get(url) name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3') #contact = driver.find_elements_by_xpath('//span[@class="item-price"]') # phone = # mobile = # address = # print(len(buyers)) # num_page_items = len(buyers) # with open('res.csv','a') as f: # for i in range(num_page_items): # f.write(buyers[i].text + "," + prices[i].text + "\n") print (name) driver.close()