Using pandas to read downloaded html file

python html import pandas

22,100

Solution 1

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

You would want to do something like this...

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

Solution 2

first of all install below packages for parsing purpose
- pip install BeautifulSoup4
- pip install lxml
- pip install html5lib

then use 'read_html' to read html table on any html page.

import pandas as pds
pds_df = pds.read_html('C:/age0.html')
pds_df[0]

I hope this will help.

Good Luck!!

22,100

Author by

lokheart

Updated on January 05, 2020

Comments

lokheart over 4 years
As title, I tried using read_html but give me the following error:
```
In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6
```
What have I done wrong?

update 01

The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?
Admin over 7 years

couldn't get this solution to work (but i also couldn't install lxml which probably had something to do with it). However, df = pd.read_html('path/to/file.html', flavor='bs4') worked fine.