BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

68,188

Solution 1

.findAll() works for nested li elements:

for ul in uls:
    for li in ul.findAll('li'):
        print(li)

Output:

<li>List items</li>
<li>Etc...</li>
<li>List items</li>
<li>Nested list items</li>
<li>Nested list items</li>
<li>List items</li>

Solution 2

A list comprehension could work, too.

lis = [li for ul in uls for li in ul.findAll('li')]

Solution 3

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_list_test")
soup =   BeautifulSoup(r.content,"lxml")
w3schollsList = soup.find_all('body')
for w3scholl in w3schollsList:
    ulList = w3scholl.find_all('li')
    for li in ulList:
        print(li)

Note: here is to get the "li" inside the div we made

Share:
68,188
danneu
Author by

danneu

Updated on November 19, 2021

Comments

  • danneu
    danneu over 2 years

    I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_film and extracts a list of "Movie Title (Year)". My HTML source looks like:

    <h3>Header3 (Start here)</h3>
    <ul>
        <li>List items</li>
        <li>Etc...</li>
    </ul>
    <h3>Header 3</h3>
    <ul>
        <li>List items</li>
        <ul>
            <li>Nested list items</li>
            <li>Nested list items</li></ul>
        <li>List items</li>
    </ul>
    <h2>Header 2 (end here)</h2>
    

    I'd like all the li tags following the first h3 tag and stopping at the next h2 tag, including all nested li tags.

    firstH3 = soup.find('h3')
    

    ...correctly finds the place I'd like to start.

    firstH3 = soup.find('h3') # Start here
    uls = []
    for nextSibling in firstH3.findNextSiblings():
        if nextSibling.name == 'h2':
            break
        if nextSibling.name == 'ul':
            uls.append(nextSibling)
    

    ...gives me a list uls, each with li contents that I need.

    Excerpt of the uls list:

    <ul>
    ...
        <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>
        <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>
        <li>Air Bud series:
            <ul>
                <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>
                <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>
                <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>
                <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>
            </ul>
        </li>
        <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li>
    ...
    </ul>
    

    But I'm unsure of where to go from here.


    Update:

    Final Code:

    lis = []
        for ul in uls:
            for li in ul.findAll('li'):
                if li.find('ul'):
                    break
                lis.append(li)
    
        for li in lis:
            print li.text.encode("utf-8")
    

    The if...break throws out the LI's that contain UL's since the nested LI's are now duplicated.

    Print output is now:

    • 102 Dalmatians(2000)
    • 10th & Wolf(2006)
    • 11:14(2006)
    • 12:08 East of Bucharest(2006)
    • 13 Going on 30(2004)
    • 1408(2007)
    • ...