How do you parse and process HTML/XML in PHP?

449,618

Solution 1

Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

A basic usage example and a general conceptual overview are available in other answers.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

A basic usage example is available in another answer.

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

A basic usage example is available, and there are lots of additional examples in the PHP Manual.


3rd Party Libraries (libxml based)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

FluentDom

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72\HtmlPageDom is a PHP library for easy manipulation of HTML documents using DOM. It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

phpQuery

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library. The library is written in PHP5 and provides additional Command Line Interface (CLI).

This is described as "abandonware and buggy: use at your own risk" but does appear to be minimally maintained.

laminas-dom

The Laminas\Dom component (formerly Zend_DOM) provides tools for working with DOM documents and structures. Currently, we offer Laminas\Dom\Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

This package is considered feature-complete, and is now in security-only maintenance mode.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.


3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

PHP Simple HTML DOM Parser

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.


HTML 5

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so suffer from slower performance and increased memory usage compared to a compiled extension in a lower-level language.

HTML5DomDocument

HTML5DOMDocument extends the native DOMDocument library. It fixes some bugs and adds some new functionality.

  • Preserves html entities (DOMDocument does not)
  • Preserves void tags (DOMDocument does not)
  • Allows inserting HTML code that moves the correct parts to their proper places (head elements are inserted in the head, body elements in the body)
  • Allows querying the DOM with CSS selectors (currently available: *, tagname, tagname#id, #id, tagname.classname, .classname, tagname.classname.classname2, .classname.classname2, tagname[attribute-selector], [attribute-selector], div, p, div p, div > p, div + p, and p ~ ul.)
  • Adds support for element->classList.
  • Adds support for element->innerHTML.
  • Adds support for element->outerHTML.

HTML5

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.

HTML5 provides the following features.

  • An HTML5 serializer
  • Support for PHP namespaces
  • Composer support
  • Event-based (SAX-like) parser
  • A DOM tree builder
  • Interoperability with QueryPath
  • Runs on PHP 5.3.0 or newer

Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way


Books

If you want to spend some money, have a look at

I am not affiliated with PHP Architect or the authors.

Solution 2

Try Simple HTML DOM Parser.

  • A HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.
  • Download

Note: as the name suggests, it can be useful for simple tasks. It uses regular expressions instead of an HTML parser, so will be considerably slower for more complex tasks. The bulk of its codebase was written in 2008, with only small improvements made since then. It does not follow modern PHP coding standards and would be challenging to incorporate into a modern PSR-compliant project.

Examples:

How to get HTML elements:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

How to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;

Extract content from HTML:

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;

Scraping Slashdot:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

Solution 3

Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.

Solution 4

Why you shouldn't and when you should use regular expressions?

First off, a common misnomer: Regexps are not for "parsing" HTML. Regexes can however "extract" data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or baseline XML parsers are their syntactic effort and varying reliability.

Consider that making a somewhat dependable HTML extraction regex:

<a\s+class="?playbutton\d?[^>]+id="(\d+)".+?    <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?

is way less readable than a simple phpQuery or QueryPath equivalent:

$div->find(".stationcool a")->attr("title");

There are however specific use cases where they can help.

  • Many DOM traversal frontends don't reveal HTML comments <!--, which however are sometimes the more useful anchors for extraction. In particular pseudo-HTML variations <$var> or SGML residues are easy to tame with regexps.
  • Oftentimes regular expressions can save post-processing. However HTML entities often require manual caretaking.
  • And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.

It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /<!--CONTENT-->(.+?)<!--END-->/ and process the remainder using the simpler HTML parser frontends.

Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.

Solution 5

Note, this answer recommends libraries that have now been abandoned for 10+ years.

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properly parse HTML in PHP.

Examples for QueryPath

Basically you first create a queryable DOM tree from an HTML string:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use XPath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularly ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features).

For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.)

Advantages

  • Simplicity and Reliability
  • Simple to use alternatives ->find("a img, a object, div a")
  • Proper data unescaping (in comparison to regular expression grepping)
Share:
449,618
RobertPitt
Author by

RobertPitt

My Home Page Admin Spot My Proposals: Code Review (Beta) Autism Followed Proposals: How stuff works DJing Please feel free to follow any of the above if you find them interesting. I am a 21 year old programmer who is still learning the major parts of programming such as Architecture, I hope I can learn from her and help others in the mean time !. I work within an ICT Enviroment, regarding Curriculum and Corporate netowrks, Im currenlty in control over 25 Servers and a 16 Hop network with 12 switch rooms, we look after networks work a large city within the UK. Languages: HTML CSS Javascript PHP C# (.NET) Programming: System Architecture Pattern Design MVC Framework ORM Mapping.

Updated on July 08, 2022

Comments

  • RobertPitt
    RobertPitt almost 2 years

    How can one parse HTML/XML and extract information from it?

  • Kornel
    Kornel over 15 years
    True. And it works with PHP's built-in XPath and XSLTProcessor classes, which are great for extracting content.
  • Frank Farmer
    Frank Farmer over 14 years
    For really mangled HTML, you can always run it through htmltidy before handing it off to DOM. Whenever I need to scrape data from HTML, I always use DOM, or at least simplexml.
  • Husky
    Husky about 14 years
    Another thing with loading malformed HTML i that it might be wise to call libxml_use_internal_errors(true) to prevent warnings that will stop parsing.
  • Bobby Jack
    Bobby Jack almost 14 years
    Not strictly true (en.wikipedia.org/wiki/Screen_scraping#Screen_scraping). The clue is in "screen"; in the case described, there's no screen involved. Although, admittedly, the term has suffered an awful lot of recent misuse.
  • RobertPitt
    RobertPitt almost 14 years
    Im not screen scraping, the content that will be parsed will be authorized by the content supplier under my agreement.
  • RobertPitt
    RobertPitt almost 14 years
    Well firstly there's things I need to prepare for such as bad DOM's, Invlid code, also js analysing against DNSBL engine, this will also be used to look out for malicious sites / content, also the as i have built my site around a framework i have built it needs to be clean, readable, and well structured. SimpleDim is great but the code is slightly messy
  • Gordon
    Gordon almost 14 years
    @Naveed that depends on your needs. I have no need for CSS Selector queries, which is why I use DOM with XPath exclusively. phpQuery aims to be a jQuery port. Zend_Dom is lightweight. You really have to check them out to see which one you like best.
  • RobertPitt
    RobertPitt almost 14 years
    as I said, I have used simple DOM many times before and its Excellent, just looking for a system with cleaner code that's highly extendible, OO(P|D) Wise etc
  • Gordon
    Gordon almost 14 years
    @Robert you might also want to check out htmlpurifier.org for the security related things.
  • Gordon
    Gordon almost 14 years
    DOMComment can read comments, so no reason to use Regex for that.
  • Alohci
    Alohci almost 14 years
    Neither SGML toolkits or XML parsers are suitable for parsing real world HTML. For that, only a dedicated HTML parser is appropriate.
  • Gordon
    Gordon almost 14 years
    @Alohci DOM uses libxml and libxml has a separate HTML parser module which will be used when loading HTML with loadHTML() so it can very much load "real-world" (read broken) HTML.
  • Alohci
    Alohci almost 14 years
    @Gordon - thanks. HTML parsers and XML parsers are still different things though, even if they're packaged in the same library. And they're both different from DOM implementations.
  • ircmaxell
    ircmaxell almost 14 years
    Well, just a comment about your "real-world consideration" standpoint. Sure, there ARE useful situations for Regex when parsing HTML. And there are also useful situations for using GOTO. And there are useful situations for variable-variables. So no particular implementation is definitively code-rot for using it. But it is a VERY strong warning sign. And the average developer isn't likely to be nuanced enough to tell the difference. So as a general rule, Regex GOTO and Variable-Variables are all evil. There are non-evil uses, but those are the exceptions (and rare at that)... (IMHO)
  • Gordon
    Gordon almost 14 years
    If you already copy my comments, at least link them properly ;) That should be: Suggested third party alternatives to SimpleHtmlDom that actually use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
  • johnlemon
    johnlemon almost 14 years
    Good answers are a great source. stackoverflow.com/questions/3606792/…
  • Erik
    Erik almost 14 years
    He's got one valid point: simpleHTMLDOM is hard to extend, unless you use decorator pattern, which I find unwieldy. I've found myself shudder just making changes to the underlying class(es) themselves.
  • Zero
    Zero almost 14 years
    I have used DOMDocument to parse about 1000 html sources (in various languages encoded with different charsets) without any issues. You might run into encoding issues with this, but they aren't insurmountable. You need to know 3 things: 1) loadHTML uses meta tag's charset to determine encoding 2) #2 can lead to incorrect encoding detection if the html content doesn't include this information 3) bad UTF-8 characters can trip the parser. In such cases, use a combination of mb_detect_encoding() and Simplepie RSS Parser's encoding / converting / stripping bad UTF-8 characters code for workarounds.
  • umpirsky
    umpirsky over 13 years
    Yes, but DOMDocument does not support CSS and XPATH queries, just getElementById or getElementsByTagName?
  • tchrist
    tchrist over 13 years
    @mario: Actually, HTML can be ‘properly’ parsed using regexes, although usually it takes several of them to do a fair job a tit. It’s just a royal pain in the general case. In specific cases with well-defined input, it verges on trivial. Those are the cases that people should be using regexes on. Big old hungry heavy parsers are really what you need for general cases, though it isn’t always clear to the casual user where to draw that line. Whichever code is simpler and easier, wins.
  • user3346601
    user3346601 over 13 years
    s/HTML5/HTML/g. The syntactical constructs HTML5 allows are mostly already allowed by any previous HTML version.
  • Gordon
    Gordon over 13 years
    @Ms2ger Mostly, but not completely. Like already pointed out above, you can use the libxml based parsers but there is special cases where those will choke. If you need maximum compatibility you are better off with a dedicated parser. I prefer to keep the distinction.
  • CurtainDog
    CurtainDog over 13 years
    My problem with loadHTML is the extra nodes it inserts, which are presumably there to "fix" the HTML but aren't actually required by the DOM spec. As such, the result of a loadHTML call is ill defined. Would have been much better to have this sort of thing happen on saveHTML.
  • Saša Šijak
    Saša Šijak over 12 years
    DOM does actually support XPath, take a look at DOMXPath.
  • Petah
    Petah over 12 years
    Your point for not using PHP Simple HTML DOM Parser seems moot.
  • Shiplu Mokaddim
    Shiplu Mokaddim about 12 years
    As of Mar 29, 2012, DOM does not support html5, XMLReader does not support HTML and last commit on html5lib for PHP is on Sep 2009. What to use to parse HTML5, HTML4 and XHTML?
  • Gordon
    Gordon about 12 years
    @Shiplu answer above lists all the options I know. DOM can parse anything that has a Schema or a DTD. HTML5 doesnt (officially).
  • MB34
    MB34 about 12 years
    What I did was run my html through tidy before sending it to SimpleDOM.
  • cHao
    cHao over 11 years
    curl can get the file, but it won't parse HTML for you. That's the hard part.
  • Nikola Petkanski
    Nikola Petkanski about 11 years
    jquery-like css queries is well said, because there are some things that are missing in w3c documentation, but are present as extra features in jquery.
  • griffin
    griffin almost 11 years
    Just to add some experience: I've used some of them, and am now always recommending ganon, as in most (of my) cases it's actually way faster than even the native versions because of how it works, and also works very well with invalid/damaged/incomplete documents (which none of the others I know of can handle at all). Sometimes it's also worth it to just regress to writing your own or use regex, but thats ONLY if you have very special and simple requirements (e.g. must only support 2 tags in a fixed format)
  • Gordon
    Gordon almost 11 years
    @Jimmy it doesn't include anything about cURL because cURL is not a tool to parse and process HTML/XML with. cURL is a client for various network protocols. For instance, you can fetch websites with it. Most of the libraries above have ways to load remote URLs directly, so you don't need cURL at all, for instance DOM has loadHTMLFile().
  • andig
    andig over 10 years
    Regarding 3rd Party Libraries (libxml based) I've found that: - QueryPath doesn't work for me as it choked on malformed HTML (even using htmlqp()) - phpQuery is a little harder to approach Besides - html5lib has a very active python part but the php port seems of low maintenance If you're looking for a quick-and-dirty solution I can recommend github.com/hkk12369/php-html-parser
  • hek2mgl
    hek2mgl almost 10 years
    Most XML parsers cannot see HTML document comments I'm not sure which parser you are using, but my parser can "read" comments. -1
  • John Slegers
    John Slegers almost 10 years
    @Gordon I suggest adding Symfony's "CSSSelector" component for adding "CSS selector" based DOM crawling to DOMDocument ( as explained in stackoverflow.com/questions/3577641/… ) and Symfony's "DOMCrawler" component, depending on whether you want low level access to the DOM or a more high level approach.
  • John Slegers
    John Slegers almost 10 years
    My preference goes to using DOMDocument->loadHTML() in combination with Symfony's "CSSSelector" component, which translates CSS Selectors to XPath selectors. It's still very low level and makes DOM a lot easier to use for those with lots of experience in frontend programming ( see stackoverflow.com/questions/3577641/… for more details )
  • Admin
    Admin about 9 years
    Remember? You cannot parse (X)HTML using regular expressions! (It struck me the day I read it I now believe mentioning regex alongside of HTML is a sin.)
  • Gordon
    Gordon about 9 years
    @Nasha I deliberately excluded the infamous Zalgo rant from the list above because it's not too helpful on it's own and lead to quite some cargo cult since it was written. People were slapped down with that link no matter how appropriate a regex would have been as a solution. For a more balanced opinion, please see the link I did include instead and go through the comments at stackoverflow.com/questions/4245008/…
  • luke_mclachlan
    luke_mclachlan about 8 years
    I'm using this currently, running it as part of a project to process a few hundred urls. It's becoming very slow and regular timeouts persist. It is a great beginners script and intuitively simple to learn, but just too basic for more advanced projects.
  • lithiumlab
    lithiumlab over 7 years
    Looks like the right tool for the job but is not loading for me in PHP 5.6.23 in Worpress. Any additional directions on how to include it correctly?. Included it with: define("BASE_PATH", dirname(FILE)); define("LIBRARY_PATH", BASE_PATH . DIRECTORY_SEPARATOR . 'lib/vendor'); require LIBRARY_PATH . DIRECTORY_SEPARATOR . 'Loader.php'; Loader::init(array(LIBRARY_PATH, USER_PATH)); in functions.php
  • ChrisJJ
    ChrisJJ over 7 years
    I've got good results from Advanced Html Dom, and I think it should be on the list in the accepted answer. An important thing to know though for anyone relying on its "The goal of this project is to be a DOM-based drop-in replacement for PHP's simple html dom library ... If you use file/str_get_html then you don't need to change anything." archive.is/QtSuj#selection-933.34-933.100 is that you may need to make changes to your code to accommodate some incompatibilities. I've noted four known to me in the project's github issues. github.com/monkeysuffrage/advanced_html_dom/issues
  • CubicleSoft
    CubicleSoft over 6 years
    Ultimate Web Scraper Toolkit's TagFilter class is distinctly missing from this list. I used Simple HTML DOM for many years because it was the most reliably consistent thing I could find. TagFilter is something I wrote initially because I needed to be able to cleanly process Word HTML but then I realized I was in reach of replacing both Simple HTML DOM and HTMLPurifier with something far more flexible, scalable (to handle multi-MB HTML files without memory leaks), and much faster. In the case of the 1MB+ HTMLPurifier library, it's much smaller and self-contained. It's also maintained.
  • scott8035
    scott8035 over 3 years
    @Gordon, thank you for the fantastically excellent answer. It's 10 years old now, I wouldn't suppose you could do a refresh for 2021?
  • Gordon
    Gordon over 3 years
    @scott8035 Thanks. I have no idea about current XML libs. I'd probably still use the native DOM extension