php: Get plain text from html - simplehtmldom or php strip_tags?

13,783

Solution 1

You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks

You would also be able to filter text from elements that aren't displayed (inline style=display:none)

That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task

Solution 2

strip_tags is sufficient for that.

Solution 3

Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.

https://github.com/mtibben/html2text

Install using composer:

composer require html2text/html2text

Basic usage:

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"
Share:
13,783
giorgio79
Author by

giorgio79

Updated on June 04, 2022

Comments

  • giorgio79
    giorgio79 almost 2 years

    I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?

    One pro for simplehtmldom is support of invalid html, is that sufficient in itself?

    • Marc B
      Marc B over 12 years
      strip tags will give you ALL of the text in the provided document. if you want a small piece of the document, then extract that part with DOM.
  • Levi Morrison
    Levi Morrison over 12 years
    I agree on everything except the elements that aren't displayed. The use-case is so small as nobody should be using inline styles except after JavaScript execution, which it looks like the OP doesn't care about.
  • FosAvance
    FosAvance almost 2 years
    I like this library