BeautifulSoup HTML table parsing
Solution 1
start = cols[1].find('a').string
or simpler
start = cols[1].a.string
or better
start = str(cols[1].find(text=True))
and
entry = [str(x) for x in cols.findAll(text=True)]
Solution 2
I was trying to reproduce your error, but the source html page was changed.
About the error, I had a similar problem, trying to reproduce the example is here
changing the proposed URL for a Wikipedia Table
I fixed it moving to BeautifulSoup4
from bs4 import BeautifulSoup
and changing the .string
for .get_text()
start = cols[1].get_text()
I couldn't test with your example (as I said before, I couldn't reproduce the error) but I think it could be useful for people are looking for a solution to this problem.
Stephen Tanner
Updated on July 09, 2022Comments
-
Stephen Tanner almost 2 years
I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1
Currently I am using BeautifulSoup and the code I have looks like this
from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1" page = mech.open(url) html = page.read() soup = BeautifulSoup(html) table = soup.find("table") rows = table.findAll('tr')[3] cols = rows.findAll('td') roadtype = cols[0].string start = cols.[1].string end = cols[2].string condition = cols[3].string reason = cols[4].string update = cols[5].string entry = (roadtype, start, end, condition, reason, update) print entry
The issue is with the start and end columns. They just get printed as "None"
Output:
(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')
I know that they get stored in the columns list, but it seems that the extra link tag is messing up the parsing with the original html looking like this:
<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td> <td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td> <td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td> <td headers="condition" class="ConditionsCellText">Moderate</td> <td headers="reason" class="ConditionsCellText">snow or ice</td> <td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>
so what should be printed is:
(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')
Any suggestions or help is appreciated, and thank you in advance.