Strip all HTML attributes except for src

11,784

Solution 1

Alright, here's what I used that seems to be working well:

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

Feel free to poke any holes in it.

Solution 2

This might work for your needs:

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP

Solution 3

You usually should not parse HTML using regular expressions.

Instead, you should call DOMDocument::loadHTML.
You can then recurse through the elements in the document and call removeAttribute.

Share:
11,784
Ian McIntyre Silber
Author by

Ian McIntyre Silber

Updated on June 05, 2022

Comments

  • Ian McIntyre Silber
    Ian McIntyre Silber about 2 years

    I'm trying to remove all tag attributes except for the src attribute. For example:

    <p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>
    

    Would be returned as:

    <p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>
    

    I have a regular expression to strip all attributes, but I'm trying to tweak it to leave in src. Here's what I have so far:

    <?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');