How to grab all headers from a website using BeautifulSoup?

15,749

Solution 1

Filter by regular expression:

soup.find_all(re.compile('^h[1-6]$'))

This regex finds all tags that start with h, have a digit after the h, and then end after the digit.

Solution 2

If you do not wish to use regex then you might wanna do something like:

from bs4 import BeautifulSoup
import requests

url = "http://nypost.com/business"

page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h3"):
    print(headlines.text.strip())

Results:

The epitome of chic fashion is the latest victim of retail's collapse
Rent-a-Center shares soar after rejecting takeover bid
NFL ad revenue may go limp with loss of erectile-dysfunction ads
'Pharma Bro' talked about sex with men to get my money, investor says

And So On------

Solution 3

when using the method find or find_all you can pass a string or a list of tags

soup.find_all([f'h{i}' for i in range(1,7) ])

or

soup.find_all(['h{}'.format(i) for i in range(1,7)])

Solution 4

you need to do soup.find_all('h1')

you could do something like:

for a in ["h1","h2"]:
  soup.find_all(a)
Share:
15,749
hiimarksman
Author by

hiimarksman

Updated on June 05, 2022

Comments

  • hiimarksman
    hiimarksman almost 2 years

    I'm trying to grab all the headers from a simple website. My attempt:

    from bs4 import BeautifulSoup, SoupStrainer
    import requests
    
    url = "http://nypost.com/business"
    page = requests.get(url)
    data = page.text
    soup = BeautifulSoup(data)
    soup.find_all('h')
    

    soup.find_all('h') returns [], but if I do something like soup.h1 or soup.h2, it returns that respective data. Am I just calling the method incorrectly?

  • hiimarksman
    hiimarksman almost 7 years
    This seems like the closest to what I was looking for. I noticed that there was h1, h2, h3, h4 for this site (by typing in manually). How would I know, in other cases, the number of "h's" that exist?
  • hiimarksman
    hiimarksman almost 7 years
    I'm finally understanding this library more... is there a way of knowing how many digits there are?
  • phd
    phd almost 7 years
    My regexp allows only one, from 1 to 6. HTML only has headers from H1 to H6. Why do you expect more? Have you ever seen a page with H16?!
  • hiimarksman
    hiimarksman almost 7 years
    I am a complete newbie to this, just found out HTML has maximum 6. Thank you for your help~
  • PYA
    PYA almost 7 years
    well, technically there could be N number of h tags, if you want to automate it completely, then you could brute force through all the h tags in a loop although that might be very very inefficient. I am not sure of a better way to do this.