Parsing Huge XML Files in PHP

90,587

Solution 1

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
    protected $_stack = array();
    protected $_file = "";
    protected $_parser = null;

    protected $_currentId = "";
    protected $_current = "";

    public function __construct($file)
    {
        $this->_file = $file;

        $this->_parser = xml_parser_create("UTF-8");
        xml_set_object($this->_parser, $this);
        xml_set_element_handler($this->_parser, "startTag", "endTag");
    }

    public function startTag($parser, $name, $attribs)
    {
        array_push($this->_stack, $this->_current);

        if ($name == "TOPIC" && count($attribs)) {
            $this->_currentId = $attribs["R:ID"];
        }

        if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
            echo $attribs["R:RESOURCE"] . "\n";
        }

        $this->_current = $name;
    }

    public function endTag($parser, $name)
    {
        $this->_current = array_pop($this->_stack);
    }

    public function parse()
    {
        $fh = fopen($this->_file, "r");
        if (!$fh) {
            die("Epic fail!\n");
        }

        while (!feof($fh)) {
            $data = fread($fh, 4096);
            xml_parse($this->_parser, $data, feof($fh));
        }
    }
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

Solution 2

This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing. However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:

My take on it:

https://github.com/prewk/XmlStreamer

A simple class that will extract all children to the XML root element while streaming the file. Tested on 108 MB XML file from pubmed.com.

class SimpleXmlStreamer extends XmlStreamer {
    public function processNode($xmlString, $elementName, $nodeIndex) {
        $xml = simplexml_load_string($xmlString);

        // Do something with your SimpleXML object

        return true;
    }
}

$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

Solution 3

I would suggest using a SAX based parser rather than DOM based parsing.

Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

Solution 4

This isn't a great solution, but just to throw another option out there:

You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).

e.g., if your doc looks like:

<dmoz>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  ...
</dmoz>

You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).

Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.

Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)

Solution 5

This is an old post, but first in the google search result, so I thought I post another solution based on this post:

http://drib.tech/programming/parse-large-xml-files-php

This solution uses both XMLReader and SimpleXMLElement :

$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL  = 'the_name_of_your_element';

$xml     = new XMLReader();
$xml->open($xmlFile);

// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}

// looping through elements
while($xml->name == $primEL) {
    // loading element data into simpleXML object
    $element = new SimpleXMLElement($xml->readOuterXML());

    // DO STUFF

    // moving pointer   
    $xml->next($primEL);
    // clearing current element
    unset($element);
} // end while

$xml->close();
Share:
90,587
Ian
Author by

Ian

People tell me I'm too verbose.

Updated on February 23, 2021

Comments

  • Ian
    Ian over 3 years

    I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?