How to grab all headers from a website using BeautifulSoup?

python web-scraping beautifulsoup python-requests

15,749

Solution 1

Filter by regular expression:

soup.find_all(re.compile('^h[1-6]$'))

This regex finds all tags that start with h, have a digit after the h, and then end after the digit.

Solution 2

If you do not wish to use regex then you might wanna do something like:

from bs4 import BeautifulSoup
import requests

url = "http://nypost.com/business"

page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h3"):
    print(headlines.text.strip())

Results:

The epitome of chic fashion is the latest victim of retail's collapse
Rent-a-Center shares soar after rejecting takeover bid
NFL ad revenue may go limp with loss of erectile-dysfunction ads
'Pharma Bro' talked about sex with men to get my money, investor says

And So On------

Solution 3

when using the method find or find_all you can pass a string or a list of tags

soup.find_all([f'h{i}' for i in range(1,7) ])

soup.find_all(['h{}'.format(i) for i in range(1,7)])

Solution 4

you need to do soup.find_all('h1')

you could do something like:

for a in ["h1","h2"]:
  soup.find_all(a)

View more solutions

15,749

Author by

hiimarksman

Updated on June 05, 2022

Comments

hiimarksman almost 2 years
I'm trying to grab all the headers from a simple website. My attempt:
```
from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://nypost.com/business"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
soup.find_all('h')
```
soup.find_all('h') returns [], but if I do something like soup.h1 or soup.h2, it returns that respective data. Am I just calling the method incorrectly?
hiimarksman almost 7 years

This seems like the closest to what I was looking for. I noticed that there was h1, h2, h3, h4 for this site (by typing in manually). How would I know, in other cases, the number of "h's" that exist?
hiimarksman almost 7 years

I'm finally understanding this library more... is there a way of knowing how many digits there are?
phd almost 7 years

My regexp allows only one, from 1 to 6. HTML only has headers from H1 to H6. Why do you expect more? Have you ever seen a page with H16?!
hiimarksman almost 7 years

I am a complete newbie to this, just found out HTML has maximum 6. Thank you for your help~
PYA almost 7 years

well, technically there could be N number of h tags, if you want to automate it completely, then you could brute force through all the h tags in a loop although that might be very very inefficient. I am not sure of a better way to do this.