Python, remove all html tags from string

34,981

Solution 1

You could use get_text()

for i in content:
    print i.get_text()

Example below is from the docs:

>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'

Solution 2

Using regEx:

re.sub('<[^<]+?>', '', text)

Using BeautifulSoup:(Solution from here)

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

Using NLTK:

import nltk   
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

Solution 3

You need to use the strings generator:

for text in content.strings:
   print(text)

Solution 4

Pyparsing makes it easy to write an HTML stripper by defining a pattern matching all opening and closing HTML tags, and then transforming the input using that pattern as a suppressor. This still leaves the &xxx; HTML entities to be converted - you can use xml.sax.saxutils.unescape to do that:

source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&amp;ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a&nbsp;4-inch iPhone, a model&nbsp;its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p> 
<p>&quot;There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features,&quot; Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p> 
<p>The new model, dubbed the iPhone SE, &quot;should unleash a decent upgrade cycle over the coming months,&quot; Dawson said.&nbsp;Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
 <div class=" ym" id="ym_44444440"></div>"""

from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"&apos;": "'", "&quot;": '"', "&nbsp;":" "})

stripper = (anyOpenTag | anyCloseTag).suppress()

print(unescape_xml_entities(stripper.transformString(source)))

gives:

Editors' Pick: Originally published March 22.  Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well. 
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments. 
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.

(And in future, please do not provide sample text or code as non-copy-pasteable images.)

Solution 5

if you restricted to use any library you can simply use the below code for remove html tags.

i just correct what you tried. thanks for the idea

content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p>&nbsp;</p>"


' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
Share:
34,981
Mustard Tiger
Author by

Mustard Tiger

Updated on March 29, 2020

Comments

  • Mustard Tiger
    Mustard Tiger about 4 years

    I am trying to access the article content from a website, using beautifulsoup with the below code:

    site= 'www.example.com'
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page)
    content = soup.find_all('p')
    content=str(content)
    

    the content object contains all of the main text from the page that is within the 'p' tag, however there are still other tags present within the output as can be seen in the image below. I would like to remove all characters that are enclosed in matching pairs of < > tags and the tags themselves. so that only the text remains.

    I have tried the following method, but it does not seem to work.

    ' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))
    

    What is the best way to remove substrings in a sting? that begin and end with a certain pattern such as < >

    enter image description here