Extract the text from `p` within `div` with BeautifulSoup

38,822

The following code finds and prints the text of each p element in the div's with the class "content-question"

from bs4 import BeautifulSoup
import urllib

url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')

table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
    print x.find('p').text

# Another way to retrieve tables:
# table = soup.select('div[class="content-question"]')

The following is the printed text of the first p element in table:

[...] Die Verhandlungen über die mögliche Visabefreiung für türkische Staatsbürger per Ende Juni sind noch nicht abgeschlossen, sodass nicht mit Sicherheit gesagt werden kann, ob es zu diesem Zeitpunkt bereits zu einer Visabefreiung kommt. Auch die genauen Modalitäten einer solchen Visaliberalisierung sind noch nicht ausverhandelt. Prinzipiell ist es jedoch so, dass Visaerleichterungen bzw. -liberalisierungen eine Frage von Reziprozität sind, d.h. dass diese für beide Staaten gelten müssten. [...]

Share:
38,822
Johannes Schwaninger
Author by

Johannes Schwaninger

Updated on April 19, 2020

Comments

  • Johannes Schwaninger
    Johannes Schwaninger about 4 years

    I am very new to web-scraping with Python, and I am really having a hard time with extracting nested text from within HTML (p within div, to be exact). Here is what I got so far:

    from bs4 import BeautifulSoup
    import urllib
    
    url = urllib.urlopen('http://meinparlament.diepresse.com/')
    content = url.read()
    soup = BeautifulSoup(content, 'lxml')
    

    This works fine:

    links=soup.findAll('a',{'title':'zur Antwort'})
    for link in links:
        print(link['href'])
    

    This extraction works fine:

    table = soup.findAll('div',attrs={"class":"content-question"})
    for x in table:
        print(x)
    

    This is the output:

    <div class="content-question">
    <p>[...] Die Verhandlungen über die mögliche Visabefreiung für    
    türkische Staatsbürger per Ende Ju...
    <a href="http://meinparlament.diepresse.com/frage/10144/" title="zur 
    Antwort">mehr »</a>
    </p>
    </div>
    

    Now, I want to extract the text within p and /p. This is the code I use:

    table = soup.findAll('div',attrs={"class":"content-question"})
    for x in table:
        print(x['p'])
    

    However, Python raises a KeyError.

  • jpp1
    jpp1 over 5 years
    This solution assumes that the HTML used on the page properly encloses all paragraphs in "p" element pairs. But this is often not the case, sometimes empy p elements are used to split the text, sometimes there is initial text, followed by spans of paragraphs, followed by trailing text, where the initial or trailing text is not enclosed in their own paragraph span etc. The solution above will only return the text enclosed by a pair of opening/closing p elements, but not the text in between. Any way to get everything?
  • zero
    zero about 4 years
    hi there Philip - i run this code on MX-Linux in ATOM but unfortunatly i get absolute no results - any idea!?