Trouble using Pandas read_html
First you can read the data from the file and then parse it using
from bs4 import BeautifulSoup
import pandas as pd
import lxml
import html5lib
path = 'file.html'
with open(path, 'rt') as myfile:
data = myfile.read().replace("<br>", '\n')
df = pd.read_html(data)
This will give you a list of dataframes. At df[1] you will get the dataframe you want
df[1]
As this is not a reliable way to parse data from an html I would advice you to go by parsing and creating your data frame using Beautiful Soup.
from bs4 import BeautifulSoup
path = 'file.html'
ecj_data = open(path,'r').read()
soup = BeautifulSoup(ecj_data)
tabulka = soup.find("table", {"class" : "MsoNormalTable"})
column_headers = ['ID','Available Quota', 'Live Weight Pounds', 'Price', 'Date Posted']
records = []
for idy, row in enumerate(tabulka.findAll('tr')):
if idy == 0:
continue
cols = row.findAll('td')
record = {}
for idx, col in enumerate(cols):
record[column_headers[idx]] = col.text.strip()
records.append(record)
df = pd.DataFrame.from_dict(records)
df[column_headers]
output will look like this:
theprowler
Updated on June 04, 2022Comments
-
theprowler almost 2 years
My goal is to extract data, specifically the data corresponding to the most recent date (in this case 5/20), from an HTML table
Here is the corresponding HTML code:
<html> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>FW: NEFS 2 Available Quota 5/21</title> <link rel="important stylesheet" href=""> <style>div.headerdisplayname {font-weight:bold;}</style></head> <body> <table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><b>Subject: </b>FW: NEFS 2 Available Quota 5/21</td></tr><tr><td><b>From: </b>Claire Fitz-Gerald <[email protected]></td></tr><tr><td><b>Date: </b>5/21/2014 10:08 AM</td></tr></table><br> <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; "><meta name=Generator content="Microsoft Word 12 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style><![endif]--><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:"Franklin Gothic Book"; panose-1:2 11 5 3 2 1 2 2 2 4;} @font-face {font-family:"Franklin Gothic Demi"; panose-1:2 11 7 3 2 1 2 2 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:11.0pt; font-family:"Calibri","sans-serif";} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:"Calibri","sans-serif"; color:windowtext;} span.title1 {mso-style-name:title1; font-family:"Arial","sans-serif"; color:#1F487E; font-weight:normal;} span.EmailStyle19 {mso-style-type:personal-reply; font-family:"Calibri","sans-serif"; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='color:#1F497D'>Please see the below quota listings.<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><div><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'>Claire Fitz-Gerald<o:p></o:p></span></p><p class=MsoNormal><i><span style='font-size:10.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'><o:p> </o:p></span></i></p><p class=MsoNormal><b><span style='font-family:"Franklin Gothic Demi","sans-serif";color:#002776'>Cape Cod Commercial Fishermen's Alliance<o:p></o:p></span></b></p><p class=MsoNormal><b><span style='font-family:"Franklin Gothic Book","sans-serif";color:#DE3500'>~ Small Boats. Big Ideas. ~</span></b><b><span style='color:#DE3500'><o:p></o:p></span></b></p></div><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> David Leveille [mailto:[email protected]] <br><b>Sent:</b> Wednesday, May 21, 2014 8:50 AM<br><b>To:</b> David Leveille<br><b>Subject:</b> NEFS 2 Available Quota 5/21<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Arial","sans-serif";color:#1F487E'>AVAILABLE QUOTA FY 2014</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width="71%" style='width:71.28%'><tr><td width=220 style='width:164.95pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><b><span style='font-size:9.0pt;font-family:"Arial","sans-serif";color:black'>ID <o:p></o:p></span></b></p></td><td width=161 style='width:120.75pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Available Quota <o:p></o:p></span></b></p></td><td width=189 style='width:141.75pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Live Weight Pounds <o:p></o:p></span></b></p></td><td width=126 style='width:94.55pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Price <o:p></o:p></span></b></p></td><td width=168 style='width:125.95pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Date Posted <o:p></o:p></span></b></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1724<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GOM COD<br>GOM HADD<br>GOM BB<br>GREYSOLE<br>DABS<br>GOM YT<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>2328<br>445<br>3007<br>850<br>3101<br>1995<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<o:p></o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'><o:p> </o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$9,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1578<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>GB BB<br>GB YT<br>SNE BB<br>SNE YT<br>GOM BB<br>Whake<br>POLL<br>RED<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>538<br>5894<br>1755<br>243<br>490<br>153<br>3965<br>2727<br>9227<br>15060<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$1.00<br>$0.40<br>$0.20<br>$1.00<br>$0.45<br>$0.50<br>$0.15<br>$0.20<br>$0.01<br>$0.01<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>310<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>DABS<br>WHAKE<br>POLL<br>RED<br>SNE BB<br>GOM BB<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>825<br>9033<br>1241<br>3120<br>65234<br>76610<br>1688<br>1195<br>2121<br>7285<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<o:p></o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'><o:p> </o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$15,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr style='height:23.25pt'><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>347<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>SNE BB<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>8,000<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$0.50<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/7<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1878A<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GOM COD<br>GOM HADD<br>SNE BB<br>GOM BB<br>GB BB<br>GREYSOLE<br>GOM YT<br>SNE YT<br>POLL<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>6188<br>635<br>3916<br>7873<br>6762<br>3358<br>9776<br>271<br>186550<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$1.95<br>$1.35<br>$0.50<br>$0.50<br>$0.20<br>$1.40<br>$1.20<br>$0.50<br>$0.01<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/12<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1878B<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>GB YT<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1113<br>12186<br>850<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<br>$10,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/12<o:p></o:p></span></p></td></tr></table><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>David Leveille<o:p></o:p></p><p class=MsoNormal>II Northeast Fishery Sector Inc.<o:p></o:p></p><p class=MsoNormal>10 Witham Street<o:p></o:p></p><p class=MsoNormal>Gloucester, MA. 01930<o:p></o:p></p><p class=MsoNormal>Cell 978 375 3509<o:p></o:p></p><p class=MsoNormal>Fax 978 281 1555<o:p></o:p></p><p class=MsoNormal>Web <a href="http://nefs2.com/">http://nefs2.com/</a><o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><div class=MsoNormal align=center style='text-align:center'><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'></body></html> </body> </html>
I tried using BeautifulSoup to access the data inside each cell. I would see 'if date 5/20 occurs below, capture all data corresponding to it'. That actually worked quite well but when I would go to put the data into a Pandas DataFrame it would fail. After many failed attempts I was told using Pandas read_html would be a wiser choice for this task.
So far my code is:
from bs4 import BeautifulSoup import pandas as pd import lxml import html5lib path = 'Z:\\blub' df = pd.pandas.read_html(path) print (df)
But when I run it the line
df = pd.pandas.read_html(path)
produces the errorValueError: No tables found
.Does that mean that the read_html command does not recognize the data table? Is using
read_html
the wisest approach for parsing this data table neatly? I say neatly because the final step after parsing is to export it all to an Oracle Database.I would appreciate any and all advice solving this problem. Thanks.
-
theprowler over 7 yearsRemoving the quotes returns the same error unfortunately
-
Marjan Moderc over 7 yearsMaybe you should put the 'r' in front of the string in order to escape \b special character. Then, make sure this file exist. I tried to read the exact html and with success. But, as @Vikash already mentioned, this is not very reliable method since I only work for very easy html tables...
-
theprowler over 7 yearsYour second method is actually similar to the way I tried to do it. The output you got looks perfect....however I keep running into the same error when I try both your methods above: I receive
Errno 13 Permission denied: 'Z:\\blub'
. I have seen this error in the past before but it still confuses me. I'm not sure what specific thing my code does differently that avoids this error but it is theopen()
lines that are causing it. I'll try to combine your code with mine and see what I can do -
Vikash Singh over 7 yearstry putting the html file and the python program in the same file.
-
theprowler over 7 yearsI put them both into the folder "Blub" and it still fails with the permission denied. I succeeded in accessing the file by using
for filename in os.listidr(path): \n file_path = os.path.join(path, filename) \n if os.path.isfile(file_path): \n with open(file_path, 'r') as f:
-
theprowler over 7 yearsSo it worked! I combined your code with mine and I succeeded in putting all the data into a DataFrame. Here's the thing though...I only want the most recent date's data, in this case just the data that goes with 5/20. Now that it's in a dataframe, is it easily parsed? I have 100's of these HTML data tables so ideally I need code that is powerful enough to capture ONLY the most recent date's corresponding data. Any ideas?
-
Vikash Singh over 7 yearsConvert the date column to date datatype and filter. Or while reading through the table using beautiful soup, add an if coundition and filter out the rows you don't want.
-
theprowler over 7 yearsWould you mind explaining what you mean by converting the dates to datetypes and filtering? I think I did something similar before when I did
df['DATE'] = pd.to_datetime(df['DATE']) latest_date = df['DATE'].max() latest_rows = df[df['DATE'] == latest_date]
But that method isn't working here for some reason -
Vikash Singh over 7 yearsLet us continue this discussion in chat.
-
Jim Chang over 2 yearsFor the multi-line cells, pandas simply combines all lines with spaces. Is there any way to keep the "\n"? I am currently replacing all "\n" in the original html with a special character and then replace it back. It looks awkward. Any better way?