Converting HTML to plain text in PHP for e-mail

189,817

Solution 1

Use html2text (example HTML to text), licensed under the Eclipse Public License. It uses PHP's DOM methods to load from HTML, and then iterates over the resulting DOM to extract plain text. Usage:

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);

// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

Although incomplete, it is open source and contributions are welcome.

Issues with other conversion scripts:

  • Since html2text (GPL) is not EPL-compatible.
  • lkessler's link (attribution) is incompatible with most open source licenses.

Solution 2

here is another solution:

$cleaner_input = strip_tags($text);

For other variations of sanitization functions, see:

https://github.com/ttodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php

Solution 3

There's the trusty strip_tags function. It's not pretty though. It'll only sanitize. You could combine it with a string replace to get your fancy underscores.


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));

// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));

?>

Solution 4

Converting from HTML to text using a DOMDocument is a viable solution. Consider HTML2Text, which requires PHP5:

Regarding UTF-8, the write-up on the "howto" page states:

PHP's own support for unicode is quite poor, and it does not always handle utf-8 correctly. Although the html2text script uses unicode-safe methods (without needing the mbstring module), it cannot always cope with PHP's own handling of encodings. PHP does not really understand unicode or encodings like utf-8, and uses the base encoding of the system, which tends to be one of the ISO-8859 family. As a result, what may look to you like a valid character in your text editor, in either utf-8 or single-byte, may well be misinterpreted by PHP. So even though you think you are feeding a valid character into html2text, you may well not be.

The author provides several approaches to solving this and states that version 2 of HTML2Text (using DOMDocument) has UTF-8 support.

Note the restrictions for commercial use.

Solution 5

You can use lynx with -stdin and -dump options to achieve that:

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}
Share:
189,817
Justin Stayton
Author by

Justin Stayton

Sr. Software Developer at Truepic.

Updated on July 08, 2022

Comments

  • Justin Stayton
    Justin Stayton almost 2 years

    I use TinyMCE to allow minimal formatting of text within my site. From the HTML that's produced, I'd like to convert it to plain text for e-mail. I've been using a class called html2text, but it's really lacking in UTF-8 support, among other things. I do, however, like that it maps certain HTML tags to plain text formatting — like putting underscores around text that previously had <i> tags in the HTML.

    Does anyone use a similar approach to converting HTML to plain text in PHP? And if so: Do you recommend any third-party classes that I can use? Or how do you best tackle this issue?

  • Alix Axel
    Alix Axel over 14 years
    Don't forget that strip tags also removes anchors!
  • Nikola Petkanski
    Nikola Petkanski over 11 years
    strip_tags() won't handle a case where you have multiple elements on several lines which are considered by html as 'inline' and will display them on multiple lines. Also, the reverse case - if you have multiple div elements on one line, it will strip the tags and concatenate the content. I've shared my experience here: stackoverflow.com/questions/1930297/…
  • Oliver Moran
    Oliver Moran almost 11 years
    The first script above is released under the GPL, which is not a "non-commercial" license. Depending on context it may be undesirable, but it is not "non-commercial". The second link also allows commercial use - just with attribution. That not "non-commercial" either.
  • jevon
    jevon almost 11 years
    @OliverMoran You're right, I've edited the answer to more accurately reflect their license limitations.
  • Redzarf
    Redzarf almost 11 years
    A good choice, except for how it handles links. But try the online demo if you're considering it.
  • malcanso
    malcanso over 10 years
    Markdownify is no longer maintained; the online demo throws many warnings and doesn't work. The new version of html2text does work for my email. A late +1 to lkessler.
  • Ninj
    Ninj over 10 years
    Thank you @jevon, i included your work in my project and it works great! Unfortunately, it didn't help to solve my Outlook problem (stackoverflow.com/questions/19135443/…) but i get clean result that way.
  • Alan M.
    Alan M. over 10 years
    Thanks for this. Worked great for my use (converting HTML for an RSS feed), and provided a simple template for adding two additional cases (&rsquo; and &mdash;).
  • mAsT3RpEE
    mAsT3RpEE over 10 years
    Better version $ClearText = preg_replace( "/\n\s+/", "\n", rtrim(html_entity_decode(strip_tags($HTMLText))) );
  • Sibidharan
    Sibidharan over 7 years
    Link broken. Down-voting.
  • Bill Bell
    Bill Bell over 7 years
    Flagged as low-quality for length and content. I dunno. Maybe the post should say something about how your code can be used to answer the problem, or maybe it should be a comment. The most popular answers seem to show how solutions can be invoked from within PHP code.
  • Rob
    Rob over 7 years
    I'm sorry for writing that library. I've added a little example for you if you don't want to click the link and look at the example..
  • Bill Bell
    Bill Bell over 7 years
    Don't be sorry! :-) I was writing as an SO reviewer. It isn't that I didn't want to click the link. It's that SO answers that require that one do that are considered substandard. I dunno why anyone would down-vote your answer incidentally.
  • Miguel
    Miguel about 7 years
    please clarify, but who will detect if someone is using or not under GLP or whatever?
  • Brian Leishman
    Brian Leishman about 7 years
    This has some issues in PHP 7
  • Alexis Wilke
    Alexis Wilke almost 7 years
    I have not seen a convert_html_to_text() function, although I was able to make the Html2Text (very first link) work without much of a problem.
  • Himanth
    Himanth almost 7 years
    dont add just answer. Please add text why this is answer
  • mili
    mili over 5 years
    this is so simple and no need another library. also working very well.......... :)
  • Maxim Mandrik
    Maxim Mandrik almost 2 years
    To remove duplicate line breaks: preg_replace('/\n{2,}/', "\n", strip_tags($htmlText))
  • Maxim Mandrik
    Maxim Mandrik almost 2 years
    To remove duplicate line breaks: preg_replace('/\n{2,}/', "\n", Html2Text::convert($html, ['ignore_errors' => true]))