Python: Extracting specific data with html parser

python html python-2.7 html-parsing html-parser

37,636

Solution 1

Looks like you forgot to set self.inLink = False in handle_starttag by default:

from HTMLParser import HTMLParser


class AllLanguages(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None

    def handle_starttag(self, tag, attrs):
        self.inLink = False
        if tag == 'a':
            for name, value in attrs:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag

    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            print data


parser = AllLanguages()
parser.feed("""
<html>
<head><title>Test</title></head>
<body>
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="English" class="Vocabulary">English</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="Russian" class="Vocabulary">Russian</a>
</body>
</html>""")

prints:

Swahili
English
Russian

Also, take a look at:

Hope that helps.

Solution 2

You may try HTQL (http://htql.net). The query for:

"the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag"

is:

<a (class='Vocabulary')>:tx

The python code is something like this:

import htql
a=htql.query(page, "<a (class='Vocabulary')>:tx")
print(a)

37,636

Author by

IssnKissn

Updated on September 10, 2020

Comments

IssnKissn over 3 years

I started using the HTMLParser in Python to extract data from a website. I get everything I wanted, except the text within two tags of HTML. Here is an example of the HTML tag:

<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>

There are also other tags starting with . They have other attributes and values and therefore I do not want to have their data:

<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>

The tag is an embedded tag within a table. I don't know if this makes any difference between other tags. I only want the information in some of the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag, in the example it would be "Swahili". So what I did is:

class AllLanguages(HTMLParser):
    '''
    classdocs
    '''
    #counter for the languages
    #countLanguages = 0
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None
        #self.text = ""


    def handle_starttag(self, tag, attr):
        #print "Encountered a start tag:", tag      
        if tag == 'a':
            for name, value in attr:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag
                    #self.lastname = name
                    #self.lastvalue = value
                    print self.lasttag
                    #print self.lastname
                    #print self.lastvalue
                    #return tag
                    print self.countLanguages




    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False
            #print "".join(self.data)

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            #self.dataArray.append(data)
            #
            print data

The programm prints every data which is included in an tag, but I only want the one included in the tag with the right attributes. How do I get this specific data?

IssnKissn almost 11 years

Thanks a lot. I expected it to be sth small ;). I try beautifulsoup too and this also works perfect. Thanks again for your help.
alecxe almost 11 years

You're welcome. Consider accepting the answer if it was helpful, thanks!
IssnKissn almost 11 years

Do you have a recommendation of using a special parser? I need the data of the html-file and want to write it in an xml-file. Which one would you use? Or what are the advantages of one of the parser?
alecxe almost 11 years

Well, beautifulspoup and lxml are decent html parsers. lxml is famous for it's speed, beautifulsoup is pretty handy but doesn't support xpath expressions. See more: blog.ianbicking.org/2008/03/30/python-html-parser-performanc‌e, stackoverflow.com/questions/3577641/…, stackoverflow.com/questions/6494199/….
IssnKissn almost 11 years

Well, I have to parse a lot of data, therefore beautifulsoup is pretty slow. But I think I will try lxml. Thanks a lot