Python: Extracting specific data with html parser
Solution 1
Looks like you forgot to set self.inLink = False
in handle_starttag
by default:
from HTMLParser import HTMLParser
class AllLanguages(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.inLink = False
self.dataArray = []
self.countLanguages = 0
self.lasttag = None
self.lastname = None
self.lastvalue = None
def handle_starttag(self, tag, attrs):
self.inLink = False
if tag == 'a':
for name, value in attrs:
if name == 'class' and value == 'Vocabulary':
self.countLanguages += 1
self.inLink = True
self.lasttag = tag
def handle_endtag(self, tag):
if tag == "a":
self.inlink = False
def handle_data(self, data):
if self.lasttag == 'a' and self.inLink and data.strip():
print data
parser = AllLanguages()
parser.feed("""
<html>
<head><title>Test</title></head>
<body>
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="English" class="Vocabulary">English</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="Russian" class="Vocabulary">Russian</a>
</body>
</html>""")
prints:
Swahili
English
Russian
Also, take a look at:
Hope that helps.
Solution 2
You may try HTQL (http://htql.net). The query for:
"the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag"
is:
<a (class='Vocabulary')>:tx
The python code is something like this:
import htql
a=htql.query(page, "<a (class='Vocabulary')>:tx")
print(a)
IssnKissn
Updated on September 10, 2020Comments
-
IssnKissn over 3 years
I started using the HTMLParser in Python to extract data from a website. I get everything I wanted, except the text within two tags of HTML. Here is an example of the HTML tag:
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
There are also other tags starting with . They have other attributes and values and therefore I do not want to have their data:
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
The tag is an embedded tag within a table. I don't know if this makes any difference between other tags. I only want the information in some of the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag, in the example it would be "Swahili". So what I did is:
class AllLanguages(HTMLParser): ''' classdocs ''' #counter for the languages #countLanguages = 0 def __init__(self): HTMLParser.__init__(self) self.inLink = False self.dataArray = [] self.countLanguages = 0 self.lasttag = None self.lastname = None self.lastvalue = None #self.text = "" def handle_starttag(self, tag, attr): #print "Encountered a start tag:", tag if tag == 'a': for name, value in attr: if name == 'class' and value == 'Vocabulary': self.countLanguages += 1 self.inLink = True self.lasttag = tag #self.lastname = name #self.lastvalue = value print self.lasttag #print self.lastname #print self.lastvalue #return tag print self.countLanguages def handle_endtag(self, tag): if tag == "a": self.inlink = False #print "".join(self.data) def handle_data(self, data): if self.lasttag == 'a' and self.inLink and data.strip(): #self.dataArray.append(data) # print data
The programm prints every data which is included in an tag, but I only want the one included in the tag with the right attributes. How do I get this specific data?
-
IssnKissn almost 11 yearsThanks a lot. I expected it to be sth small ;). I try beautifulsoup too and this also works perfect. Thanks again for your help.
-
alecxe almost 11 yearsYou're welcome. Consider accepting the answer if it was helpful, thanks!
-
IssnKissn almost 11 yearsDo you have a recommendation of using a special parser? I need the data of the html-file and want to write it in an xml-file. Which one would you use? Or what are the advantages of one of the parser?
-
alecxe almost 11 yearsWell, beautifulspoup and lxml are decent html parsers. lxml is famous for it's speed, beautifulsoup is pretty handy but doesn't support xpath expressions. See more: blog.ianbicking.org/2008/03/30/python-html-parser-performance, stackoverflow.com/questions/3577641/…, stackoverflow.com/questions/6494199/….
-
IssnKissn almost 11 yearsWell, I have to parse a lot of data, therefore beautifulsoup is pretty slow. But I think I will try lxml. Thanks a lot