Using BeautifulSoup to find specific text on a webpage

20,325

Locate the element by text, get the next font sibling and parse the list of events in the b tags getting the event dates from a previous sibling.

Complete working code:

from bs4 import BeautifulSoup
import requests


url = "http://sunsettheatre.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")

font = soup.find("b", text="Past Movies:").find_next_sibling("font")
for event in font.find_all("b", recursive=False):
   event_date = event.previous_sibling.strip()
   event_text = event.get_text(strip=True)
   print(event_date, event_text)

Prints:

(u'January 1, 2 & 3:', u'Alvin and the Chipmunks: The Road Chip')
(u'January 8, 9 & 10:', u"Daddy's Home")
(u'January 15, 16 & 17:', u'Star Wars: The Force Awakens')
(u'January 22, 23 & 24:', u'Star Wars: The Force Awakens 3D')
(u'January 29, 30 & 31:', u'Norm of the North')
(u'February 5, 6 & 7:', u'The Forest')
(u'February 12, 13 & 14', u'Kung Fu Panda 3')
(u'February 19, 20 & 21', u'Kung Fu Panda 3 3D')
(u'February 26, 27 & 28', u'Ride Along 2')
(u'March 4, 5 & 6', u'Deadpool')
(u'March 11, 12 & 13', u'Gods of Egypt')
(u'March 18, 19 & 20', u'Zootopia')
(u'March 25, 26 & 27', u'Zootopia 3D')
(u'April 1, 2 & 3', u'The Divergent Series: Allegiant')
(u'April 8, 9 & 10', u'Miracles From Heaven')
(u'April 29, 30 & May 1', u'Batman v Superman')
Share:
20,325

Related videos on Youtube

Nathan R
Author by

Nathan R

Updated on May 12, 2020

Comments

  • Nathan R
    Nathan R almost 4 years

    I'm trying to save a list of movies from a website using Python 3 and Beautiful Soup 4. The problem is, I'm pretty new to Python and BS and I don't really know where to begin.

    The website is http://sunsettheatre.com and the list of movies is right after "Past Movies:". I don't know how to extract that chunk. I've been googling this and it seems Beautiful Soup works best when trying to find tags, but I just need it to find that one list of text, which is not in any particular tag (the website is not professionally designed).

    Is there any way to make Beautiful Soup and Python extract the text BETWEEN "Past Movies:" and "For a complete list of movies we've played click here"?