Using pandas to read downloaded html file
22,100
Solution 1
I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.
You would want to do something like this...
from bs4 import BeautifulSoup
import pandas as pd
table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
#otherwise try str(table) as input
Solution 2
first of all install below packages for parsing purpose
- pip install BeautifulSoup4
- pip install lxml
- pip install html5lib
then use 'read_html' to read html table on any html page.
import pandas as pds pds_df = pds.read_html('C:/age0.html') pds_df[0]
I hope this will help.
Good Luck!!
Author by
lokheart
Updated on January 05, 2020Comments
-
lokheart over 4 years
As title, I tried using
read_html
but give me the following error:In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml') File "<string>", line unknown XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6
What have I done wrong?
update 01
The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?
-
Admin over 7 yearscouldn't get this solution to work (but i also couldn't install lxml which probably had something to do with it). However,
df = pd.read_html('path/to/file.html', flavor='bs4')
worked fine.