How to grab all headers from a website using BeautifulSoup?
15,749
Solution 1
Filter by regular expression:
soup.find_all(re.compile('^h[1-6]$'))
This regex finds all tags that start with h
, have a digit after the h
, and then end after the digit.
Solution 2
If you do not wish to use regex then you might wanna do something like:
from bs4 import BeautifulSoup
import requests
url = "http://nypost.com/business"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h3"):
print(headlines.text.strip())
Results:
The epitome of chic fashion is the latest victim of retail's collapse
Rent-a-Center shares soar after rejecting takeover bid
NFL ad revenue may go limp with loss of erectile-dysfunction ads
'Pharma Bro' talked about sex with men to get my money, investor says
And So On------
Solution 3
when using the method find or find_all you can pass a string or a list of tags
soup.find_all([f'h{i}' for i in range(1,7) ])
or
soup.find_all(['h{}'.format(i) for i in range(1,7)])
Solution 4
you need to do soup.find_all('h1')
you could do something like:
for a in ["h1","h2"]:
soup.find_all(a)
Author by
hiimarksman
Updated on June 05, 2022Comments
-
hiimarksman almost 2 years
I'm trying to grab all the headers from a simple website. My attempt:
from bs4 import BeautifulSoup, SoupStrainer import requests url = "http://nypost.com/business" page = requests.get(url) data = page.text soup = BeautifulSoup(data) soup.find_all('h')
soup.find_all('h')
returns[]
, but if I do something likesoup.h1
orsoup.h2
, it returns that respective data. Am I just calling the method incorrectly? -
hiimarksman almost 7 yearsThis seems like the closest to what I was looking for. I noticed that there was h1, h2, h3, h4 for this site (by typing in manually). How would I know, in other cases, the number of "h's" that exist?
-
hiimarksman almost 7 yearsI'm finally understanding this library more... is there a way of knowing how many digits there are?
-
phd almost 7 yearsMy regexp allows only one, from 1 to 6. HTML only has headers from
H1
toH6
. Why do you expect more? Have you ever seen a page withH16
?! -
hiimarksman almost 7 yearsI am a complete newbie to this, just found out HTML has maximum 6. Thank you for your help~
-
PYA almost 7 yearswell, technically there could be N number of
h
tags, if you want to automate it completely, then you could brute force through all theh
tags in a loop although that might be very very inefficient. I am not sure of a better way to do this.