Using pandas to read downloaded html file

22,100

Solution 1

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

You would want to do something like this...

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

Solution 2

  1. first of all install below packages for parsing purpose

    • pip install BeautifulSoup4
    • pip install lxml
    • pip install html5lib
  2. then use 'read_html' to read html table on any html page.


    import pandas as pds
    pds_df = pds.read_html('C:/age0.html')
    pds_df[0]
    

I hope this will help.

Good Luck!!

Share:
22,100
lokheart
Author by

lokheart

Updated on January 05, 2020

Comments

  • lokheart
    lokheart over 4 years

    As title, I tried using read_html but give me the following error:

    In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
      File "<string>", line unknown
    XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6
    

    What have I done wrong?

    update 01

    The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

  • Admin
    Admin over 7 years
    couldn't get this solution to work (but i also couldn't install lxml which probably had something to do with it). However, df = pd.read_html('path/to/file.html', flavor='bs4') worked fine.