Remove all inline styles using BeautifulSoup
Solution 1
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract()
, which returns the tag. You just need decompose()
:
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.
Solution 2
I wouldn't do this in BeautifulSoup
- you'll spend a lot of time trying, testing, and working around edge cases.
Bleach
does exactly this for you. http://pypi.python.org/pypi/bleach
If you were to do this in BeautifulSoup
, I'd suggest you go with the "whitelist" approach, like Bleach
does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.
Solution 3
Here's my solution for Python3 and BeautifulSoup4:
def remove_attrs(soup, whitelist=tuple()):
for tag in soup.findAll(True):
for attr in [attr for attr in tag.attrs if attr not in whitelist]:
del tag[attr]
return soup
It supports a whitelist of attributes which should be kept. :) If no whitelist is supplied all the attributes get removed.
Solution 4
What about lxml's Cleaner?
from lxml.html.clean import Cleaner
content_without_styles = Cleaner(style=True).clean_html(content)
Solution 5
Based on jmk's function, i use this function to remove attributes base on a white list:
Work in python2, BeautifulSoup3
def clean(tag,whitelist=[]):
tag.attrs = None
for e in tag.findAll(True):
for attribute in e.attrs:
if attribute[0] not in whitelist:
del e[attribute[0]]
#e.attrs = None #delte all attributes
return tag
#example to keep only title and href
clean(soup,["title","href"])
Ila
Updated on April 20, 2021Comments
-
Ila about 3 years
I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:
[s.extract() for s in soup('script')]
But how to remove inline styles? For instance the following:
<p class="author" id="author_id" name="author_name" style="color:red;">Text</p> <img class="some_image" href="somewhere.com">
Should become:
<p>Text</p> <img href="somewhere.com">
How to delete the inline class, id, name & style attributes of all elements?
Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.
-
jmk over 11 yearsCool, I didn't know about Bleach. I wasn't thinking of the use case, but if the goal is to sanitize untrusted HTML, then this definitely seems like a better approach. You get my upvote!
-
Jonathan Vanasco over 11 yearsBleach is pretty great. I really like it.
-
Ila over 11 yearsI was using extract() in case I decided to generate a list of removed code at any point, but decompose() works just as well for completely removing & destroying tags & content. Thanks for the attribute-delete snippet, works like a charm!
-
jmk over 11 yearsMakes sense. I'll leave the note about
decompose()
for anyone else who might stumble across this. -
Can Bascil over 10 yearsYou shouldn't be passing mutable structures as default function parameter values. As seen here.