using python, Remove HTML tags/formatting from a string

50,999

Solution 1

If you are going to use regex:

import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

>>> striphtml('<a href="foo.com" class="bar">I Want This <b>text!</b></a>')
'I Want This text!'

Solution 2

AFAIK using regex is a bad idea for parsing HTML, you would be better off using a HTML/XML parser like beautiful soup.

Solution 3

Use lxml.html. It's much faster than BeautifulSoup and raw text is a single command.

>>> import lxml.html
>>> page = lxml.html.document_fromstring('<!DOCTYPE html>...</html>')
>>> page.cssselect('body')[0].text_content()
'...'

Solution 4

Use SGMLParser. regex works in simple case. But there are a lot of intricacy with HTML you rather not have to deal with.

>>> from sgmllib import SGMLParser
>>>
>>> class TextExtracter(SGMLParser):
...     def __init__(self):
...         self.text = []
...         SGMLParser.__init__(self)
...     def handle_data(self, data):
...         self.text.append(data)
...     def getvalue(self):
...         return ''.join(ex.text)
...
>>> ex = TextExtracter()
>>> ex.feed('<html>hello &gt; world</html>')
>>> ex.getvalue()
'hello > world'

Solution 5

Depending on whether the text will contain '>' or '<' I would either just make a function to remove anything between those, or use a parsing lib

def cleanStrings(self, inStr):
  a = inStr.find('<')
  b = inStr.find('>')
  if a < 0 and b < 0:
    return inStr
  return cleanString(inStr[a:b-a])
Share:
50,999

Related videos on Youtube

Blankman
Author by

Blankman

... .. . blank

Updated on July 09, 2022

Comments

  • Blankman
    Blankman almost 2 years

    I have a string that contains html markup like links, bold text, etc.

    I want to strip all the tags so I just have the raw text.

    What's the best way to do this? regex?

  • Will McCutchen
    Will McCutchen almost 14 years
    This will only work reliably on well-formed HTML (ie, no unescaped < or > outside of actual tags, no malformed tags like <b class="forgot-to-close", etc.). That being said, this is the first approach I'd use, depending on the source data.
  • derekerdmann
    derekerdmann almost 14 years
    +1 for Beautiful Soup
  • Blankman
    Blankman almost 14 years
    I am using beautifulsoup, but I want to be able to strip html tags manually also. thanks!
  • volting
    volting almost 14 years
    @Blankman it would of been a good idea to mention that in your question
  • Trufa
    Trufa about 13 years
    Please add more clarification as to the very limited situations where that would be a good idea and I'll remove my down-vote. Thank you.
  • Shaokan
    Shaokan almost 13 years
    plus this will also the remove the following text => "if 3 < 5 then 5 > 3"
  • hasienda
    hasienda almost 13 years
    Thanks, have been looking a while for such a solution requiring no external dependency. Changing ''.join(ex.text) into ''.join(self.text) made it suitable even as a stand-alone class.
  • Adam
    Adam almost 10 years
    Great solution, thanks! Use this snippet for extracting text from HTML fragments: lxml.html.fromstring('some HTML fragment').text_content()
  • syzygy
    syzygy almost 10 years
    He's not parsing HTML, he's removing tags. Parsing HTML/XML is very slow, often the slowest aspect of applications that use it, so I would not recommend BeautifulSoup for this. HTML parsing cannot be done with regex because regexes do not have stacks (LIFOs), and HTML can be arbitrarily nested, which requires a stack to parse.
  • tommy.carstensen
    tommy.carstensen over 8 years
    Why is beautiful soup better for html parsing? I use regexes myself. Have I missed the light? Thanks.
  • Homunculus Reticulli
    Homunculus Reticulli over 6 years
    This should be the accepted answer. Using regex to parse HTML (especially directly of the internet) is a VERY bad idea!
  • Ice Bear
    Ice Bear over 3 years
    well kind of depends on the situation tho...
  • Cyber Axe
    Cyber Axe about 3 years
    This simply strips all HTML code and replaces it with nothing, it would be nice if it inserted appropriate line breaks so you didnt end up with a single line of nonsense