Sanitizing HTML input
Solution 1
You will have to decide between good and lightweight. The recommended choice is 'HTMLPurifier', because it provide no-fuss secure defaults. As faster alternative it is often advised to use 'htmLawed'.
See also this quite objective overview from the HTMLPurifier author: http://htmlpurifier.org/comparison
Solution 2
I really like HTML Purifier, which allows you to specify which tags and attirbutes are allowed in your HTML code -- and generates valid HTML.
Solution 3
Use BB codes (or like here on SO), otherwise chances are very slim. Example function...
function parse($string){
$pattern = array(
"/\[url\](.*?)\[\/url\]/",
"/\[img\](.*?)\[\/img\]/",
"/\[img\=(.*?)\](.*?)\[\/img\]/",
"/\[url\=(.*?)\](.*?)\[\/url\]/",
"/\[red\](.*?)\[\/red\]/",
"/\[b\](.*?)\[\/b\]/",
"/\[h(.*?)\](.*?)\[\/h(.*?)\]/",
"/\[p\](.*?)\[\/p\]/",
"/\[php\](.*?)\[\/php\]/is"
);
$replacement = array(
'<a href="\\1">\\1</a>',
'<img alt="" src="\\1"/>',
'<img alt="" class="\\1" src="\\2"/>',
'<a rel="nofollow" target="_blank" href="\\1">\\2</a>',
'<span style="color:#ff0000;">\\1</span>',
'<span style="font-weight:bold;">\\1</span>',
'<h\\1>\\2</h\\3>',
'<p>\\1</p>',
'<pre><code class="php">\\1</code></pre>'
);
$string = preg_replace($pattern, $replacement, $string);
$string = nl2br($string);
return $string;
}
...
echo parse("[h2]Lorem Ipsum[/h2][p]Dolor sit amet[/p]");
Result...
<h2>Lorem Ipsum</h2><p>Dolor sit amet</p>
Or just use HTML Purifier :)
Solution 4
Both HTML Purifier and htmLawed are good. htmLawed has the advantage of a much smaller footprint and high configurability. Besides doing the standard work of balancing tags, filtering specific HTML tags or their attributes or attribute content (through white or black lists), etc., it also allows the use of custom functions.
Related videos on Youtube
James P.
Updated on June 04, 2022Comments
-
James P. almost 2 years
I'm thinking of adding a rich text editor to allow a non-programmer to change the aspect of text. However, one issue is that it's possible to distort the layout of a rendered page if the markup is incorrect. What's a good lightweight way to sanitize html?
-
James P. about 13 yearsGood suggestion. I'm wondering why an animated dragon appeared when upvoting you though :p .
-
Lauren about 13 yearsIn order for BBCode to be secured, you would have to run it through a a purifier such as HTMLPurifier anyway. There's really no point. Naive BBCode is wide open to exploits: consider what the input string
[img]http://picture.of.a/pony.png" onload="execute(); arbitrary(); javascript();[/img]
would be produced as using the above parser. -
Dejan Marjanović about 13 yearsYup, definitely not for public usage, I ignored security aspect completely, I thought it was for private usage. @James P., use HTMLPurifier ;)
-
James P. about 13 yearsThanks. I got HTMLPurifier working. The documentation isn't easy to read but I managed to get it to filter some rich text to a minimum and adapted the charset to iso to avoid accents getting removed.
-
ymakux over 7 yearsTo someone who consider htmLawed: first look at the code - you'll cry. There's no alternative to HTMLPurifier at this moment. Just to save your time
-
ymakux over 7 years+ nice things like $GLOBALS['C'] = $C;
-
user594694 about 7 yearsWhat's wrong with the code? Just because you cannot understand it does not make it bad. htmLawed is just too much faster, smaller and more efficient that HTMLPurifier to not consider because it is not written the way you like.
-
DennisK over 5 yearsThe HTMLLawed author seems to have no sense of security. The website and forum is not using HTTPS, and the website urges you to disable Composer's secure-http, as he cannot be arsed to move to HTTPS or a Git repository. I wouldn't trust anything security-related to that person.