php: Get plain text from html - simplehtmldom or php strip_tags?
Solution 1
You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks
You would also be able to filter text from elements that aren't displayed (inline style=display:none)
That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task
Solution 2
strip_tags
is sufficient for that.
Solution 3
Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.
https://github.com/mtibben/html2text
Install using composer:
composer require html2text/html2text
Basic usage:
$html = new \Html2Text\Html2Text('Hello, "<b>world</b>"');
echo $html->getText(); // Hello, "WORLD"
giorgio79
Updated on June 04, 2022Comments
-
giorgio79 almost 2 years
I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?
One pro for simplehtmldom is support of invalid html, is that sufficient in itself?
-
Marc B over 12 yearsstrip tags will give you ALL of the text in the provided document. if you want a small piece of the document, then extract that part with DOM.
-
-
Levi Morrison over 12 yearsI agree on everything except the elements that aren't displayed. The use-case is so small as nobody should be using inline styles except after JavaScript execution, which it looks like the OP doesn't care about.
-
FosAvance almost 2 yearsI like this library