How do I strip all javascript out of an HTML document using PHP?

13,764

Solution 1

echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $var); 

As shown here.

Solution 2

You can use strip_tags, passing in the tags you wish to allow (whitelist) as the second parameter, but that will not remove inline JS - which might be present in onclick properties and such.

echo strip_tags($html, '<p><a><small>');

Solution 3

Look at Create a regex to strip javascript from Html article. And Part 2.

Solution 4

There's no guarantee with this(as below) but I tried to make my light weight solution because html purifier (http://htmlpurifier.org) is a few huge for my tiny goal. My goal is to preventing XSS and nothing more so the result for XSS attempts will be a lot of dirty things for this code BUT I think it will be SAFE :

<?
//href="javascript:
//style="....expression
//style="....behavior
//<script
//on*="
$str = '
    asd 
    <a STyLE="asd; expression" hRef=" javascript:" onx="asd">asd</a>
    asd
    <code><a href="javascript:">asd</a></code>
    <scr<script></script>ipt ... >asd</script>
    <a style="hey:good boy;" href="javascript:">asd</a>';

function stripteaser($str, $StripHTMLTags = true, $AllowableTags = NULL) {
    $str = explode('<code>', $str);
    $codes = array();
    if (count($str) > 1) {
        foreach ($str as $idx => $val) {
            $val = explode('</code>', $val);
            if (count($val) > 1) {
                $uid = md5(uniqid(mt_rand(), true));
                $codes[$uid] = htmlentities(array_shift($val), ENT_QUOTES, 'UTF-8');
                $str[$idx] = "##$uid##" . implode('', $val);
            }
        }
    }
    $str = implode('', $str);
    while (stripos($str, '<script') !== false) {
        $str = str_ireplace('<script', '&lt;script', $str);
    }
    $rptjob = function(&$str, $regexp) {
                while (preg_match($regexp, $str, $matches)) {
                    $str = str_ireplace($matches[0], htmlentities($matches[0], ENT_QUOTES, 'UTF-8'), $str);
                }
            };
    $rptjob($str, '/href[\s\n\t]*=[\s\n\t]*[\"\'][\s\n\t]*(javascript:|data:)/i'); //href = "javascript:
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\"][^\"]*expression/i'); //style = "...expression
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\'][^\']*expression/i'); //style = '...expression
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\"][^\"]*behavior/i'); //style = "...behavior
    $rptjob($str, '/style[\s\n\t]*=[\s\n\t]*[\'][^\']*behavior/i'); //style = '...behavior
    $rptjob($str, '/on\w+[\s\n\t]*=[\s\n\t]*[\"\']/i'); //onasd = "
    if ($StripHTMLTags)
        $str = strip_tags($str, $AllowableTags);
    foreach ($codes as $idx => $code) {
        $str = str_replace("##$idx##", $code, $str);
    }
    return $str;
}

echo stripteaser($str);
exit;
?>

:D Dirty code for this moon at home and ... However it's not a good job (a lot of while conditions take a few CPU time) but it's better than another huge component like html purifier for my tiny goal.

RESULT WILL BE:

asd 
<a STyLE=&quot;asd; expression" hRef=&quot; javascript:" onx=&quot;asd">asd</a>
asd
&lt;a href=&quot;javascript:&quot;&gt;asd&lt;/a&gt;
<scri&lt;script></script>pt ... >asd</script>
<a style="hey:good boy;" href=&quot;javascript:">asd</a>

I have no experience to css expressions but I know about behavior using for JS VML in IE for curved corners so can be dangerous. AND FINALLY THERE IS NO AND NO GUARANTEE.

I hope it can be useful for some friend ;)

Share:
13,764
Etienne Marais
Author by

Etienne Marais

I am passionate about clean code, Good music and great coffee. I love coding for the web and create amazing api's

Updated on June 15, 2022

Comments

  • Etienne Marais
    Etienne Marais almost 2 years

    In my email program I use Tidy to clean up the HTML before I send out the emails. A problem is beginning to persist that if I send a mail fetching the html from a url on the web there may exist some javascript in the document.

    I want to clean up this html document even more by stripping out all javascript, embedded, referenced and in any form so that the mail exist only of html.

    I want to use php's preg_replace() to strip out all javascript from a mail and I need some help with the best regex because it's not my strongest point i must confess.

  • Hannes
    Hannes over 13 years
    +1 clean and easy, I never get though why ppl always use / as terminators
  • Mike Samuel
    Mike Samuel over 13 years
    This won't strip out javascript in javascript: URLs or data: URLs, or in event handlers, or javascript in CSS expression(...) or other schemes. It probably won't handle <script with embedded NULs.
  • Mike Samuel
    Mike Samuel over 13 years
    And this will also fail badly on some trivial inputs like <scrip<script></script>t>alert(1337)</script>.
  • Etienne Marais
    Etienne Marais over 13 years
    Thank you for your help, The javascript I am getting is of a set format so additional checks is not necessary. I'm getting a book on regular expressions today! Powerful little pieces of code but way over my head at the moment.
  • garkin
    garkin almost 11 years
    Welcome to the <img src='road to nowhere' onerror='alert(4311)'>
  • Positivity
    Positivity over 10 years
    then on an allowed <a> , we may have <a href="#" onclick="alert('Zombie!')">Click Me!</a>
  • DrLightman
    DrLightman over 7 years
    This solution does not remove the possible javascript code that was enclosed by the stripped script tags.