SGML Parser in Python

10,242

Solution 1

Beautiful Soup is one way you could parse it nicely (and it's the way I'd always do it, unless there was some extremely good reason not to do it that way, myself). It's a lot simpler and more readable than using SGMLParser.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<post id='100'> <title> new title </title> <text> <p> new text </p> </text> </post>''')
>>> soup('post')  # soup.findAll('post') is equivalent
[<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>]
>>> for post in soup('post'):
...     print post.findChild('text')
...
<text> <p> new text </p> </text>

Once you've got it at this stage, you can do various things with it, depending on how you want it.

>>> post = soup.find('post')
>>> post
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>
>>> post_text = post.findChild('text')
>>> post_text
<text> <p> new text </p> </text>

You could be wanting to strip out HTML.

>>> post_text.text
u'new text'

Or perhaps look at the contents...

>>> post_text.renderContents()
' <p> new text </p> ']
>>> post_text.contents
[u' ', <p> new text </p>, u' ']

There are all sorts of things you could want to do. If you're more specific - in especial, providing real data - it helps.

When it comes to manipulating the tree, you can do that too.

>>> post
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>
>>> post.title  # Just as good as post.findChild('title')
<title> new title </title>
>>> post.title.extract()  # Throws it out of the tree and returns it but we have no need for it
<title> new title </title>
>>> post  # title is gone!
<post id="100">  <text> <p> new text </p> </text> </post>
>>> post.findChild('text').replaceWithChildren()  # Thrown away the <text> wrapping
>>> post
<post id="100">   <p> new text </p>  </post>

And so, finally, you'd have something like this:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''
... <post id='100'> <title> new title 100 </title> <text> <p> new text 100 </p> </text> </post>
... <post id='101'> <title> new title 101 </title> <text> <p> new text 101 </p> </text> </post>
... <post id='102'> <title> new title 102 </title> <text> <p> new text 102 </p> </text> </post>
... ''')
>>> for post in soup('post'):
...     post.title.extract()
...     post.findChild('text').replaceWithChildren()
... 
<title> new title 100 </title>
<title> new title 101 </title>
<title> new title 102 </title>
>>> soup

<post id="100">   <p> new text 100 </p>  </post>
<post id="101">   <p> new text 101 </p>  </post>
<post id="102">   <p> new text 102 </p>  </post>

Solution 2

Your code resets the "title" attribute every time the end_title() is called. The title you end up with is therefore the last title in the document.

What you need to do is store a list of all the titles you find. In the following, I also reset data to None (so you don't collect text data outside of title elements) and I used "".join instead of string.join because your use of the latter is considered old-fashioned

class ExtractTitle(sgmllib.SGMLParser):
  def __init__(self, verbose=0):
    sgmllib.SGMLParser.__init__(self, verbose)
    self.titles = []
    self.data = None

  def handle_data(self, data):
    if self.data is not None:
      self.data.append(data)

  def start_title(self, attrs):
    self.data = []

  def end_title(self):
    self.titles.append("".join(self.data))
    self.data = None

and here it is in use:

>>> parser = ExtractTitle()
>>> parser.feed("<doc><rec><title>Spam and Eggs</title></rec>" +
...             "<rec><title>Return of Spam and Eggs</title></rec></doc>")
>>> parser.close()
>>> parser.titles
['Spam and Eggs', 'Return of Spam and Eggs']
>>> 

Solution 3

use lxml instead of SGMLParser:

>>> posts = """
... <post id='100'> <title> xxxx </title> <text> <p> yyyyy </p> </text> </post>
... <post id='101'> <title> new title1 </title> <text> <p> new text1 </p> </text> </post>
... <post id='102'> <title> new title2 </title> <text> <p> new text2 </p> </text> </post>
... """
>>> from lxml import html
>>> parsed = html.fromstring(posts)
>>> new_file = html.Element('div')
>>> for post in parsed:
...     post_id = post.attrib['id']
...     post_text = post.find('text').text_content()
...     new_post = html.Element('post', id=post_id)
...     new_post.text = post_text
...     new_file.append(new_post)
... 
>>> html.tostring(new_file)
'<div><post id="100"> yyyyy  </post><post id="101"> new text1  </post><post id="102"> new text2  </post></div>'
>>> 
Share:
10,242

Related videos on Youtube

afg102
Author by

afg102

Updated on May 15, 2022

Comments

  • afg102
    afg102 about 2 years

    I am completely new to Python. I have the following code:

    class ExtractTitle(sgmllib.SGMLParser):
    
    def __init__(self, verbose=0):
    
       sgmllib.SGMLParser.__init__(self, verbose)
    
       self.title = self.data = None
    
    def handle_data(self, data):
    
      if self.data is not None:
        self.data.append(data)
    
    def start_title(self, attrs):
     self.data = []
    
    def end_title(self):
    
      self.title = string.join(self.data, "")
    
    raise FoundTitle # abort parsing!
    

    which extracts the title element from SGML, however it only works for a single title. I know I have to overload the unknown_starttag and unknown_endtag in order to get all titles but I keep getting it wrong. Help me please!!!

  • virhilo
    virhilo over 13 years
    Beautiful Soup is slow and dead;)
  • afg102
    afg102 over 13 years
    thanks for your reply. I am trying to extract from a file so I did : filexy = open(fileurl) and posts = filexy.read() and then your code. However for some reason it is only showing the same text (i.e. it is not looping through all the tags) Do you have any idea? Thanks
  • Chris Morgan
    Chris Morgan over 13 years
    @virhilo: "slow"? Perhaps in processing it is, but in development time it tends to be brilliantly fast. And that's generally what matters now. And "dead"? It's got practically all that's desired, there's not anything much extra to do for it. The fact that it doesn't have any active development going on (which I will grant you) doesn't bother me much at all.
  • virhilo
    virhilo over 13 years
    could you paste some example document?
  • afg102
    afg102 over 13 years
    Thanks guys it working now :) any ideas on how to write the results to an external file pls?
  • Chris Morgan
    Chris Morgan over 13 years
    @afg102: with what I've got, you can then write it to a file with outfile = open('filename', 'w'), outfile.write(soup.renderContents()) (unicode(soup) would work just as well, too)
  • afg102
    afg102 over 13 years
    I was wondering if maybe you guys were into NLTK. I'm using the function FreqDist to get the frequency of the words in my text obtained from the file I generated. I tried this: filey = open(fileurl") p= filey.read() fdist = FreqDist(p) vocab = fdist.keys() vocab[:30] -> but the result is a list of single letters whereas in the example from the nltk website, that should result in a list of whole words. Any help please?
  • Andrew Dalke
    Andrew Dalke over 13 years
    How did it not work? What's your test case and how did it fail? I added an example to show that it does work for me.
  • afg102
    afg102 over 13 years
    ok great :D i had a slight error in my code. Thanks a lot! Do you have any idea about another question I posted pls? stackoverflow.com/questions/4634787/freqdist-with-nltk
  • TryPyPy
    TryPyPy over 13 years
    That'd be a new question, no?