python parse html table using lxml
19,141
You could do something like
>>> doc = """<TABLE>
... <TR>
... <TD><P>Name</P></TD>
... <TD><P>Fees</P></TD>
... <TD><P>Awards</P></TD>
... <TD><P>Total</P></TD>
... </TR>
... <TR>
... <TD><P>Tony</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Paul</FONT></P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Richard</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
...
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>
If you have 2 tables in document, you can first loop on tables and then use a relative XPath expression (with a leading .
) for descendant text nodes on each table
>>> doc = """<TABLE>
... <TR>
... <TD><P>Name</P></TD>
... <TD><P>Fees</P></TD>
... <TD><P>Awards</P></TD>
... <TD><P>Total</P></TD>
... </TR>
... <TR>
... <TD><P>Tony</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Paul</FONT></P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Richard</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
...
... </TR>
... </TABLE>
... <TABLE>
... <TR>
... <TD><P>Name</P></TD>
... <TD><P>Fees</P></TD>
... <TD><P>Awards</P></TD>
... <TD><P>Total</P></TD>
... </TR>
... <TR>
... <TD><P>Tony</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Paul</FONT></P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Richard</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
...
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400', 'Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>> for tbl in root.xpath('//table'):
... elements = tbl.xpath('.//tr/td//text()')
... print elements
...
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>
Related videos on Youtube
Author by
Kishore K
Updated on September 16, 2022Comments
-
Kishore K over 1 year
I've an html table like this:
<TABLE> <TR> <TD><P>Name</P></TD> <TD><P>Fees</P></TD> <TD><P>Awards</P></TD> <TD><P>Total</P></TD> </TR> <TR> <TD><P>Tony</P></TD> <TD >7,800</TD> <TD >7</TD> <TD>15,400</TD> </TR> <TR> <TD><P>Paul</FONT></P></TD> <TD >7,800</TD> <TD >7</TD> <TD>15,400</TD> </TR> <TR> <TD><P>Richard</P></TD> <TD >7,800</TD> <TD >7</TD> <TD>15,400</TD> </TR> </TR> </TABLE>
I want to extract the values of table. I'd tried the following.
import lxml.html html = lxml.html.parse(''html_table) text_value = html.xpath('//tr/td/text()') packages = html.xpath('//tr/td/p') p_content = [p.text_content() for p in packages]
is there any way to extract both the
<p>
text and the text of<td>
to a single list ?-
cog_n1t1v3 over 10 yearsAlso, explore the BeautifulSoup module for parsing HTML: pythonforbeginners.com/python-on-the-web/beautifulsoup-4-python
-
-
paul trmbrth over 10 years@kishorekdty , I just added an example for loop on tables