Creating a JSON structure out of HTML with BeautifulSoup
Here's one way to do it:
from bs4 import BeautifulSoup
data = """your html goes here"""
soup = BeautifulSoup(data)
inner_ul = soup.find('ul', class_='innerUl')
inner_items = [li.text.strip() for li in inner_ul.ul.find_all('li')]
outer_ul_text = soup.ul.span.text.strip()
inner_ul_text = inner_ul.span.text.strip()
result = {outer_ul_text: {inner_ul_text: inner_items}}
print result
prints:
{u'Outer List': {u'Inner List': [u'info 1', u'info 2', u'info 3']}}
Comments
-
tumultous_rooster almost 2 years
I am working with BeautifulSoup, trying to parse a simple HTML document that looks like this:
(Naturally, the structure continues on this basic manner a few levels deeper. However, occasionally there is an extra bullet, which provides no information, thus I would like to ignore it.)
My goal is to parse this into JSON format.
I would like the end result look like:
{ "Outer List": { "Inner List" : [ "info 1", "info 2", "info 3" ] } }
The code below (all thanks due @Zero Piraeus) comes very close to working: Parsing nested HTML list with BeautifulSoup
from bs4 import BeautifulSoup from pprint import pprint soup = BeautifulSoup(""" <html> <body> <ul class="rootList"> <li class="liItem endPlus"> <span class="itemToBeAdded"> Outer List </span> </li> <li class="noBulletsLi "> <ul class="innerUl"> <li class="liItem crossPlus"> <span class="itemToBeAdded"> Inner List </span> <ul class="grayStarUl "> <li class=""> <span class="phrasesToBeAdded"> info 1 </span> </li> <li class=""> <span class="phrasesToBeAdded"> info 2 </span> </li> <li class=""> <span class="phrasesToBeAdded"> info 3 </span> </li> </ul> </li> </ul> </li> </ul> </body> </html> """) ul = soup.body.ul def dictify(ul): result = {} for li in ul.find_all("li", recursive=False): list = [] key = next(li.stripped_strings) ul = li.find("ul") if ul: result[key] = dictify(ul) else: result[key] = None return result
This bit of code returns:
{u'Inner List': {u'Inner List': {u'info 1': None, u'info 2': None, u'info 3': None}}, u'Outer List': None}
Whereas I'm trying to reach:
{u'Outer List': {u'Inner List': [u'info 1', 'info 2', 'info 3']}
How can I get my
'info X'
items into a list as the value, in a hopefully reasonably scalable manner....and ignore those pesky bullets that may be present?Thank you for reading this far!
-
tumultous_rooster about 10 yearsThis is fantastic! I'm wondering, could it be generalized such that I could read in the same list, given it had more sublists?
-
WitchGod about 10 yearsYou have to be more specific: sublist inside which list? Outer List or Inner List, etc?
-
tumultous_rooster about 10 yearsSay,
info 2
had a sublist such thatinfo2_sublist_1
,info2_sublist_2
, andinfo2_sublist_3
were beneath it, in the HTML? Since this is JSON, wouldn't it be nice to end up with{u'Outer List': {u'Inner List': [u'info 1',{info 2: ['info2_sublist_1', 'info2_sublist_2','info2_sublist_3]}','info 3']}
? -
tumultous_rooster about 10 yearsOK guys, thanks for your help! I asked a follow-up question here: stackoverflow.com/questions/22672292/…