strip all classes from p tags

14,207

Solution 1

A fairly naive regex will probably work for you

$html=preg_replace('/class=".*?"/', '', $html);

I say naive because it would fail if your body text happened to contain class="something" for some reason!. It could be made a little more robust by looking for class="" inside angled bracketted tags if need be.

Solution 2

Maybe it's a bit overkill for your need, but, to parse/validate/clean HTML data, the best tool I know is HTML Purifier

It allows you to define which tags, and which attributes, are OK ; and/or which ones are not ; and it gives valid/clean (X)HTML as output.

(Using regexes to "parse" HTML seems OK at the beginning... And then, when you want to add specific stuff, it generally becomes hell to understand/maintain)

Solution 3

You load the HTML into a DOMDocument class, load that into simpleXML. Then you do an XPath query for all p elements and then loop through them. On each loop, you rename the class attribute to something like "killmeplease".

When that's done, reoutput the simpleXML as XML (which, by the way, may change the HTML, but usually only for the better), and you will have a HTML string where each p has a class of "killmeplease". Use str_replace to actually remove them.

Example:

$html_file = "somehtmlfile.html";

$dom = new DOMDocument();
$dom->loadHTMLFile($html_file);

$xml = simplexml_import_dom($dom);

$paragraphs = $xml->xpath("//p");

foreach($paragraphs as $paragraph) {
     $paragraph['class'] = "killmeplease";
 }

 $new_html = $xml->asXML();

 $better_html = str_replace('class="killmeplease"', "", $new_html);

Or, if you want to make the code more simple but tangle with preg_replace, you could go with:

$html_file = "somehtmlfile.html";
$html_string = file_get_contents($html_file);

$bad_p_class = "/(<p ).*(class=.*)(\s.*>)/";

$better_html = preg_replace($bad_p_class, '$1 $3', $html_string);

The tricky part with regular expressions is they tend to be greedy and trying to turn that off can cause problems if your p element tag has a line break in it. But give either of those a shot.

Solution 4

HTML Purifier

HTML can be very tricky to regex because of the hundreds of different ways code can be written or formatted.

The HTML purifier is a mature open source library for cleaning up HTML. I would advise its usage in this case.

In HTML purifier's configuration documentation, you can specify classes and attributes which should be allowed and what the purifier should do if it finds them.

http://htmlpurifier.org/docs/

Solution 5

$html = "<p id='fine' class='r3e1 b4d 1' style='widows: inherit;'>";    
preg_replace('/\sclass=[\'|"][^\'"]+[\'|"]/', '', $html);

If you are being put to the test against Microsoft Office-exported HTML you'll need more than class-removal but HTML Tidy has a config flag just for Microsoft Office!

Otherwise, this should be safer than some other answers given they are a little greedy and you don't know what sort of encapsulation will be used (' or ").

Note: The pattern is actually /\sclass=['|"][^'"]+['|"]/ but, as there are both inverted commas (") apostrophes ('), I had to escape all occurrences of one (\') to encapsulate the pattern.

Share:
14,207
SoulieBaby
Author by

SoulieBaby

Hmmm what do you want to know? :)

Updated on July 29, 2022

Comments

  • SoulieBaby
    SoulieBaby almost 2 years

    I was just wondering if any one knew a function to remove ALL classes from a string in php.. Basically I only want

    <p> 
    

    tags rather than

    <p class="...">
    

    If that makes sense :)

  • kliron
    kliron almost 15 years
    Not sure how that could be better without knowing why the OP wanted to do this.
  • Teknotica
    Teknotica almost 15 years
    Not better, just other way to do it :)
  • joebert
    joebert almost 15 years
    Correct me if I'm wrong, but don't the lexical analyzers true XML parsers use pick the XML apart with regex anyways ? I think the real issue is that when people try to do regex parsers themselves they try to jump to the middle or end of a string instead of starting at the beginning of the string like a true parser does.
  • Pascal MARTIN
    Pascal MARTIN almost 15 years
    I don't think they do -- not sure about it, but... seems odd. Anyway, even if they do, they are probably more tested (because they are widely used) than the regex you will write yourself for your own project.
  • Jon Winstanley
    Jon Winstanley almost 15 years
    Does the code work with upper/lower case, single/double/no quotes, spaces inbetween, spaces before and after the class?
  • kliron
    kliron almost 15 years
    No - only the cases indicated by the OP. Anything else is left as an exercise for the reader :)
  • Bhargav Nanekalva
    Bhargav Nanekalva almost 11 years
    Don't use regex for HTML. Instead use PHP Simple HTML DOM Parser library.