How to read html table in pandas and output to dataframe not list
10,517
Pandas .read_html()
function will return a list of dataframes where each dataframe is a table found on the page. Using StackOverflow's leagues, we can see that there are two tables on the right side of the page. As you can see below, a list is what read_html()
is returning.
url = 'https://stackexchange.com/leagues/1/alltime/stackoverflow'
df_list = pd.read_html(url)
print(df_list)
# [ Rep Change* Users <-- first table
# 0 10,000+ 15477
# 1 5,000+ 33541
# 2 2,500+ 68129
# 3 1,000+ 155430
# 4 500+ 272683
# 5 250+ 429742
# 6 100+ 458600
# 7 50+ 458600
# 8 1+ 458600,
# Total Rep* Users <-- second table
# 0 100,000+ 697
# 1 50,000+ 1963
# 2 25,000+ 5082
# 3 10,000+ 15477
# 4 5,000+ 33541
# 5 3,000+ 56962
# 6 2,000+ 84551
# 7 1,000+ 155430
# 8 500+ 272683
# 9 200+ 458600
# 10 1+ 10381503]
print(len(df_list))
# 2
From here, you just need to specify which table you want to work with. If there's only one table, it's pretty easy to figure out which one to use.
df = df_list[0]
print(df)
# Rep Change* Users
# 0 10,000+ 15477
# 1 5,000+ 33541
# 2 2,500+ 68129
# 3 1,000+ 155430
# 4 500+ 272683
# 5 250+ 429742
# 6 100+ 458600
# 7 50+ 458600
# 8 1+ 458600
print(df.shape)
# (9, 2)
![Abhinav Kumar](https://lh4.googleusercontent.com/-UfVo_k3vjMk/AAAAAAAAAAI/AAAAAAAAAfY/0JB1cruV-pg/photo.jpg?sz=256)
Author by
Abhinav Kumar
Updated on August 03, 2022Comments
-
Abhinav Kumar almost 2 years
I am reading html table from html file into pandas, and want to get it as a dataframe not a list so that I can perform general dataframe operations.
I am facing error as below whenever I try anything except for printing whole dataframe.
print(dfdefault.shape()) AttributeError: 'list' object has no attribute 'shape'
-
Cohan about 5 yearsHow are you importing the html file?
-
anky about 5 yearsuse
df=dfdefault[0]
anddf.shape()
-
piRSquared about 5 yearsto expand on @anky_91's suggestion...
pd.read_html
returns a list of dataframes. If only one table existed in the html you parsed then that list will only have one element in it. That is why they suggesteddfdefault[0]
because that gets the first item in the list which will be a dataframe. -
Abhinav Kumar about 5 years@BrianCohan importing it by - dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details')
-
-
Abhinav Kumar about 5 yearsWhile I am able to read the html table into pandas now, but very small tables, not the tables which has some 10000 records. Any suggestions ?
-
Cohan about 5 yearsIs this a page that you can share with me? I can try and see what I can figure out. If
read_html()
doesn't work, the next thing that I would try is usingBeautifulSoup
to parse it and add one row at a time to the DataFrame. But I have not tried that, I don't know if it will solve your issue, or the nature of why your program isn't allowing you to read in 10,000 records from the table. Another thing I have done in the past is look and see if the table is being generated by some JSON file and seeing if I can pull that instead of the rendered html. -
Abhinav Kumar about 5 yearsThanks for help, I will try . Sorry the file cant be shared, else I would already have.