Creating a JSON structure out of HTML with BeautifulSoup

15,795

Here's one way to do it:

from bs4 import BeautifulSoup

data = """your html goes here""" 
soup = BeautifulSoup(data)

inner_ul = soup.find('ul', class_='innerUl')
inner_items = [li.text.strip() for li in inner_ul.ul.find_all('li')]

outer_ul_text = soup.ul.span.text.strip()
inner_ul_text = inner_ul.span.text.strip()

result = {outer_ul_text: {inner_ul_text: inner_items}}
print result

prints:

{u'Outer List': {u'Inner List': [u'info 1', u'info 2', u'info 3']}}
Share:
15,795
tumultous_rooster
Author by

tumultous_rooster

SOreadytohelp

Updated on June 04, 2022

Comments

  • tumultous_rooster
    tumultous_rooster almost 2 years

    I am working with BeautifulSoup, trying to parse a simple HTML document that looks like this:

    enter image description here

    (Naturally, the structure continues on this basic manner a few levels deeper. However, occasionally there is an extra bullet, which provides no information, thus I would like to ignore it.)

    My goal is to parse this into JSON format.

    I would like the end result look like:

    {
        "Outer List": {
            "Inner List" : [
                "info 1", 
                "info 2", 
                "info 3"
            ]
        }
    }
    

    The code below (all thanks due @Zero Piraeus) comes very close to working: Parsing nested HTML list with BeautifulSoup

    from bs4 import BeautifulSoup
    from pprint import pprint
    soup = BeautifulSoup("""
       <html>
         <body>
          <ul class="rootList">
           <li class="liItem endPlus">
            <span class="itemToBeAdded">
             Outer List
            </span>
           </li>
           <li class="noBulletsLi ">
            <ul class="innerUl">
             <li class="liItem crossPlus">
              <span class="itemToBeAdded">
               Inner List
              </span>
              <ul class="grayStarUl ">
               <li class="">
                <span class="phrasesToBeAdded">
                 info 1
                </span>
               </li>
               <li class="">
                <span class="phrasesToBeAdded">
                 info 2
                </span>
               </li>
               <li class="">
                <span class="phrasesToBeAdded">
                 info 3
                </span>
               </li>
              </ul>
             </li>
              </ul>
             </li>
            </ul>
         </body>
        </html>
    """)
    
    ul = soup.body.ul
    
    def dictify(ul):
        result = {}
        for li in ul.find_all("li", recursive=False):
            list = []
            key = next(li.stripped_strings)
            ul = li.find("ul")
            if ul:
                result[key] = dictify(ul)
            else:
                result[key] = None
        return result
    

    This bit of code returns:

    {u'Inner List': {u'Inner List': {u'info 1': None,
                                     u'info 2': None,
                                     u'info 3': None}},
     u'Outer List': None}
    

    Whereas I'm trying to reach:

    {u'Outer List': {u'Inner List': [u'info 1',
                                      'info 2',
                                      'info 3']}
    

    How can I get my 'info X' items into a list as the value, in a hopefully reasonably scalable manner....and ignore those pesky bullets that may be present?

    Thank you for reading this far!

  • tumultous_rooster
    tumultous_rooster about 10 years
    This is fantastic! I'm wondering, could it be generalized such that I could read in the same list, given it had more sublists?
  • WitchGod
    WitchGod about 10 years
    You have to be more specific: sublist inside which list? Outer List or Inner List, etc?
  • tumultous_rooster
    tumultous_rooster about 10 years
    Say, info 2 had a sublist such that info2_sublist_1, info2_sublist_2, and info2_sublist_3 were beneath it, in the HTML? Since this is JSON, wouldn't it be nice to end up with {u'Outer List': {u'Inner List': [u'info 1',{info 2: ['info2_sublist_1', 'info2_sublist_2','info2_sublist_3]}','info 3']}?
  • tumultous_rooster
    tumultous_rooster about 10 years
    OK guys, thanks for your help! I asked a follow-up question here: stackoverflow.com/questions/22672292/…