How to read html table in pandas and output to dataframe not list

10,517

Pandas .read_html() function will return a list of dataframes where each dataframe is a table found on the page. Using StackOverflow's leagues, we can see that there are two tables on the right side of the page. As you can see below, a list is what read_html() is returning.

url = 'https://stackexchange.com/leagues/1/alltime/stackoverflow'
df_list = pd.read_html(url)
print(df_list)
# [  Rep Change*   Users <-- first table
# 0     10,000+   15477
# 1      5,000+   33541
# 2      2,500+   68129
# 3      1,000+  155430
# 4        500+  272683
# 5        250+  429742
# 6        100+  458600
# 7         50+  458600
# 8          1+  458600,
#    Total Rep*     Users <-- second table
# 0    100,000+       697
# 1     50,000+      1963
# 2     25,000+      5082
# 3     10,000+     15477
# 4      5,000+     33541
# 5      3,000+     56962
# 6      2,000+     84551
# 7      1,000+    155430
# 8        500+    272683
# 9        200+    458600
# 10         1+  10381503]

print(len(df_list))
# 2

From here, you just need to specify which table you want to work with. If there's only one table, it's pretty easy to figure out which one to use.

df = df_list[0]
print(df)
#   Rep Change*   Users
# 0     10,000+   15477
# 1      5,000+   33541
# 2      2,500+   68129
# 3      1,000+  155430
# 4        500+  272683
# 5        250+  429742
# 6        100+  458600
# 7         50+  458600
# 8          1+  458600
print(df.shape)
# (9, 2)
Share:
10,517
Abhinav Kumar
Author by

Abhinav Kumar

Updated on August 03, 2022

Comments

  • Abhinav Kumar
    Abhinav Kumar almost 2 years

    I am reading html table from html file into pandas, and want to get it as a dataframe not a list so that I can perform general dataframe operations.

    I am facing error as below whenever I try anything except for printing whole dataframe.

    print(dfdefault.shape())
    AttributeError: 'list' object has no attribute 'shape'
    
    • Cohan
      Cohan about 5 years
      How are you importing the html file?
    • anky
      anky about 5 years
      use df=dfdefault[0] and df.shape()
    • piRSquared
      piRSquared about 5 years
      to expand on @anky_91's suggestion... pd.read_html returns a list of dataframes. If only one table existed in the html you parsed then that list will only have one element in it. That is why they suggested dfdefault[0] because that gets the first item in the list which will be a dataframe.
    • Abhinav Kumar
      Abhinav Kumar about 5 years
      @BrianCohan importing it by - dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details')
  • Abhinav Kumar
    Abhinav Kumar about 5 years
    While I am able to read the html table into pandas now, but very small tables, not the tables which has some 10000 records. Any suggestions ?
  • Cohan
    Cohan about 5 years
    Is this a page that you can share with me? I can try and see what I can figure out. If read_html() doesn't work, the next thing that I would try is using BeautifulSoup to parse it and add one row at a time to the DataFrame. But I have not tried that, I don't know if it will solve your issue, or the nature of why your program isn't allowing you to read in 10,000 records from the table. Another thing I have done in the past is look and see if the table is being generated by some JSON file and seeing if I can pull that instead of the rendered html.
  • Abhinav Kumar
    Abhinav Kumar about 5 years
    Thanks for help, I will try . Sorry the file cant be shared, else I would already have.