How to convert an HTML table into a Python dictionary
13,004
Solution 1
from bs4 import BeautifulSoup
from collections import OrderedDict
from pprint import pprint
soup = BeautifulSoup(data, 'lxml')
d = OrderedDict()
for th, td in zip(soup.select('th'), soup.select('td')[::2]):
d[th.text.strip()] = td.text.strip().splitlines()
pprint(d)
Prints:
OrderedDict([('Mon', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Tue', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Wed', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Thu', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Fri', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
('Sat', ['5:00 pm - 10:00 pm']),
('Sun', ['Closed'])])
Solution 2
Here's a solution which first reads into Pandas DataFrame, and then converts to dictionary as in your desired output:
import pandas as pd
dfs = pd.read_html(html_string)
df = dfs[0] # pd.read_html reads in all tables and returns a list of DataFrames
Giving:
0 1 2
0 Mon 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
1 Tue 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
2 Wed 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm Open now
3 Thu 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
4 Fri 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
5 Sat 5:00 pm - 10:00 pm NaN
6 Sun Closed NaN
Then use groupby
and a dictionary comprehension:
summary = {k: v.iloc[0, 1].split(' ') for k, v in df.groupby(0)}
Giving:
{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Sat': ['5:00 pm - 10:00 pm'],
'Sun': ['Closed'],
'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}
You may need to edit slightly if splitting on exactly two spaces won't always work for your opening times data format.
Solution 3
Use a library to parse the HTML, something like this:
import pandas as panda
url = r'https://en.wikipedia.org/wiki/List_of_sovereign_states'
tables = panda.read_html(url)
sp500_table = tables[0] #Selecting the first table (for example)
Author by
mysl
Updated on June 04, 2022Comments
-
mysl almost 2 years
I have the following HTML excerpt in a format of a Python list that I'd like to turn into a dictionary. It is a timetable for everyday of the week.
[u' <table class="hours table">\n <tbody>\n <tr>\n <th scope="row">Mon</th>\n <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span> <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n <td class="extra">\n </td>\n </tr>\n\n <tr>\n <th scope="row">Tue</th>\n <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span> <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n <td class="extra">\n </td>\n </tr>\n\n <tr>\n <th scope="row">Wed</th>\n <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span> <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n <td class="extra">\n <span class="nowrap open">Open now</span>\n </td>\n </tr>\n\n <tr>\n <th scope="row">Thu</th>\n <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span> <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n <td class="extra">\n </td>\n </tr>\n\n <tr>\n <th scope="row">Fri</th>\n <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span> <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n <td class="extra">\n </td>\n </tr>\n\n <tr>\n <th scope="row">Sat</th>\n <td>\n <span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n <td class="extra">\n </td>\n </tr>\n\n <tr>\n <th scope="row">Sun</th>\n <td>\n Closed\n </td>\n <td class="extra">\n </td>\n </tr>\n\n </tbody>\n </table>']
The wishful output is:
{ 'Mon': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 'Tue': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 'Wed': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 'Thu': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 'Fri': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 'Sat': '5:00pm - 10:00pm', 'Sun': 'Closed' }
How would you achieve this in Python 3.x? I would not mind if the 'Sat' and 'Sun' keys have values in a list format if that'd help at all. Thank you for your thoughts in advance.