Minifying final HTML output using regular expressions with CodeIgniter

15,730

Solution 1

For those curious about how Alan Moore's regex works (and yes, it does work), I've taken the liberty of commented it so it can be read by mere mortals:

function process_data_alan($text) // 
{
    $re = '%# Collapse ws everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          (?:           # Begin (unnecessary) group.
            (?:         # Zero or more of...
              [^<]++    # Either one or more non-"<"
            | <         # or a < starting a non-blacklist tag.
              (?!/?(?:textarea|pre)\b)
            )*+         # (This could be "unroll-the-loop"ified.)
          )             # End (unnecessary) group.
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %ix';
    $text = preg_replace($re, " ", $text);
    return $text;
}

I'm new around here, but I can see right off that Alan is quite good at regex. I would only add the following suggestions.

  1. There is an unnecessary capture group which can be removed.
  2. Although the OP did not say so, the <SCRIPT> element should be added to the <PRE> and <TEXTAREA> blacklist.
  3. Adding the 'S' PCRE "study" modifier speeds up this regex by about 20%.
  4. There is an alternation group in the lookahead which is ripe for applying Friedl's "unrolling-the-loop" efficiency construct.
  5. On a more serious note, this same alternation group: (i.e. (?:[^<]++|<(?!/?(?:textarea|pre)\b))*+) is susceptible to excessive PCRE recursion on large target strings, which can result in a stack-overflow causing the Apache/PHP executable to silently seg-fault and crash with no warning. (The Win32 build of Apache httpd.exe is particularly susceptible to this because it has only 256KB stack compared to the *nix executables, which are typically built with 8MB stack or more.) Philip Hazel (the author of the PCRE regex engine used in PHP) discusses this issue in the documentation: PCRE DISCUSSION OF STACK USAGE. Although Alan has correctly applied the same fix as Philip shows in this document (applying a possessive plus to the first alternative), there will still be a lot of recursion if the HTML file is large and has a lot of non-blacklisted tags. e.g. On my Win32 box (with an executable having a 256KB stack), the script blows up with a test file of only 60KB. Note also that PHP unfortunately does not follow the recommendations and sets the default recursion limit way too high at 100000. (According to the PCRE docs this should be set to a value equal to the stack size divided by 500).

Here is an improved version which is faster than the original, handles larger input, and gracefully fails with a message if the input string is too large to handle:

// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777");  // 8MB stack. *nix
function process_data_jmr1($text) // 
{
    $re = '%# Collapse whitespace everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          [^<]*+        # Either zero or more non-"<" {normal*}
          (?:           # Begin {(special normal*)*} construct
            <           # or a < starting a non-blacklist tag.
            (?!/?(?:textarea|pre|script)\b)
            [^<]*+      # more non-"<" {normal*}
          )*+           # Finish "unrolling-the-loop"
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre|script)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %Six';
    $text = preg_replace($re, " ", $text);
    if ($text === null) exit("PCRE Error! File too big.\n");
    return $text;
}

p.s. I am intimately familiar with this PHP/Apache seg-fault problem, as I was involved with helping the Drupal community while they were wrestling with this issue. See: Optimize CSS option causes php cgi to segfault in pcre function "match". We also experienced this with the BBCode parser on the FluxBB forum software project.

Hope this helps.

Solution 2

Sorry for not commenting, reputation missing ;)

I want to urge everybody not to implement such regex without checking for performance penalties. Shopware implemented the first regex (from Alan/ridgerunner) for their HTML minify and "blow up" every shop with bigger pages.

If possible, a combined solution (regex + some other logic) is most of the time faster and more maintainable (except you are Damian Conway) for complex problems.

Also i want to mention, that most minifier can break your code (JavaScript and HTML), when in a script-block itself is another script-block via document.write i.e.

Attached my solution (an optimized version off user2677898 snippet). I simplified the code and run some tests. Under PHP 7.2 my version was ~30% faster for my special testcase. Under PHP 7.3 and 7.4 the old variant gained much speed and is only ~10% slower. Also my version is still better maintainable due to less complex code.

function filterHtml($content) {
{
    // List of untouchable HTML-tags.
    $unchanged = 'script|pre|textarea';

    // It is assumed that this placeholder could not appear organically in your
    // output. If it can, you may have an XSS problem.
    $placeholder = "@@<'-pLaChLdR-'>@@";

    // Some helper variables.
    $unchangedBlocks  = [];
    $unchangedRegex   = "!<($unchanged)[^>]*?>.*?</\\1>!is";
    $placeholderRegex = "!$placeholder!";

    // Replace all the tags (including their content) with a placeholder, and keep their contents for later.
    $content = preg_replace_callback(
        $unchangedRegex,
        function ($match) use (&$unchangedBlocks, $placeholder) {
            array_push($unchangedBlocks, $match[0]);
            return $placeholder;
        },
        $content
    );

    // Remove HTML comments, but not SSI
    $content = preg_replace('/<!--[^#](.*?)-->/s', '', $content);

    // Remove whitespace (spaces, newlines and tabs)
    $content = trim(preg_replace('/[ \n\t]{2,}|[\n\t]/m', ' ', $content));

    // Replace the placeholders with the original content.
    $content = preg_replace_callback(
        $placeholderRegex,
        function ($match) use (&$unchangedBlocks) {
            // I am a paranoid.
            if (count($unchangedBlocks) == 0) {
                throw new \RuntimeException("Found too many placeholders in input string");
            }
            return array_shift($unchangedBlocks);
        },
        $content
    );

    return $content;
}
Share:
15,730
Aman
Author by

Aman

An all around versatile web developer who knows ins and outs of web-development with more than 5 years of experience in php, who has eye for exciting projects which helps people to solve issues. :)

Updated on June 08, 2022

Comments

  • Aman
    Aman almost 2 years

    Google pages suggest you to minify HTML, that is, remove all the unnecessary spaces. CodeIgniter does have the feature of giziping output or it can be done via .htaccess. But still I also would like to remove unnecessary spaces from the final HTML output as well.

    I played a bit with this piece of code to do it, and it seems to work. This does indeed result in HTML that is without excess spaces and removes other tab formatting.

    class Welcome extends CI_Controller 
    {
        function _output()
        {
            echo preg_replace('!\s+!', ' ', $output);
        }
    
        function index(){
        ...
        }
    }
    

    The problem is there may be tags like <pre>,<textarea>, etc.. which may have spaces in them and a regular expression should remove them. So, how do I remove excess space from the final HTML, without effecting spaces or formatting for these certain tags using a regular expression?

    Thanks to @Alan Moore got the answer, this worked for me

    echo preg_replace('#(?ix)(?>[^\S ]\s*|\s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+)(?:<(?>textarea|pre)\b|\z))#', ' ', $output);
    

    ridgerunner did a very good job of analyzing this regular expression. I ended up using his solution. Cheers to ridgerunner.

  • Aman
    Aman about 13 years
    Wow that was quite in depth analysis, I didn't knew all these details. Thanx a lot, I will try your regex.
  • Aman
    Aman about 13 years
    could i have the test file that you were using ?
  • ridgerunner
    ridgerunner about 13 years
    @Aman Yes, but it will be some time before I post it (the file is an article in progress (in HTML)...)
  • william
    william almost 12 years
    am I the only one who gets an error 324 when render this regex via php? My error log says: child pid 4736 exit signal Segmentation fault (11) ?? :S
  • ridgerunner
    ridgerunner almost 12 years
    @william - "render"? error 324 from what - httpd.exe? php.exe? Will need more information to proceed. First try setting pcre.recursion_limit to 524 (the script currently sets it to 16777). Just comment out the one line and uncomment the other.
  • william
    william almost 12 years
    Oh sorry - From apache server. At least I get the error info: "PCRE Error! File too big.".
  • superhero
    superhero about 11 years
    @ridgerunner I'm using this code in a c++ project I'm working on. Some small changes and then it worked just fine. But I notice that we end up with "> <". Do you think it's wise to extend the already existing regex to prevent this in some way, or would you run a new one after the first regex has occurred?
  • ridgerunner
    ridgerunner about 11 years
    @Erik Landvall - without seeing the changes you made, there is no way for me to answer your question. (Also, the solution above is PHP and you say you are using C++). Maybe you can post a new question specific to the issues you are having. Note that I don't have access to a C++ compiler that does regex, so I won't be able to help much.
  • superhero
    superhero about 11 years
    @ridgerunner I usually code in PHP but trying out something new :) If you would be so kind and have a look at: stackoverflow.com/q/16134469/570796 - The difference shouldn't be overwhelming.
  • ridgerunner
    ridgerunner about 11 years
    @Erik Landvall - Ok, I'll take a look, but I'm busy today. However, after a quick glance just now, I'd have to agree that a proper HTML parser would be your best solution.
  • superhero
    superhero about 11 years
    @ridgerunner Yea, I'm looking in to a parser by boost as we speak. The original issue about the > < combination has alredy been explained though!
  • chocolata
    chocolata over 10 years
    Hi, very nice script. I've just implemented into my project. Would you say this is a good practice to minify HTML like this? Or would the server load be heavier, thus slowing down the entire process anyway? My pages are very very long. I don't get the memory warning limit though... What do you think?
  • ridgerunner
    ridgerunner over 10 years
    @maartenmachiels - Sorry but I can't offer you an opinion one way or the other. If you do use regex, be sure to read and take safeguards as recommended in my answer to a similar question. Stack overflows and silent crashing of executables is not good!
  • Umair Hamid
    Umair Hamid over 9 years
    How to add code to remove js comments in this RegEx?
  • Jürgen Hörmann
    Jürgen Hörmann about 5 years
    Be careful. This minification will break JS code that is wrapped inside a CDATA wrapper. The example should be extended to exclude not only pre and textarea but additionally the CDATA blocks.
  • Rodger
    Rodger about 4 years
    You are close. Keep this one open and change it to a comment when you can. I gave you an upvote so you only will need one more when you get another. Then change this to a comment and delete your answer that isn't an answer please. ;-)