What is the best way to parse (big) XML in C# Code?

70,463

Solution 1

Use XmlReader to parse large XML documents. XmlReader provides fast, forward-only, non-cached access to XML data. (Forward-only means you can read the XML file from beginning to end but cannot move backwards in the file.) XmlReader uses small amounts of memory, and is equivalent to using a simple SAX reader.

    using (XmlReader myReader = XmlReader.Create(@"c:\data\coords.xml"))
    {
        while (myReader.Read())
        {
           // Process each node (myReader.Value) here
           // ...
        }
    }

You can use XmlReader to process files that are up to 2 gigabytes (GB) in size.

Ref: How to read XML from a file by using Visual C#

Solution 2

Asat 14 May 2009: I've switched to using a hybrid approach... see code below.

This version has most of the advantages of both:
  * the XmlReader/XmlTextReader (memory efficiency --> speed); and
  * the XmlSerializer (code-gen --> development expediancy and flexibility).

It uses the XmlTextReader to iterate through the document, and creates "doclets" which it deserializes using the XmlSerializer and "XML binding" classes generated with XSD.EXE.

I guess this recipe is universally applicable, and it's fast... I'm parsing a 201 MB XML Document containing 56,000 GML Features in about 7 seconds... the old VB6 implementation of this application took minutes (or even hours) to parse large extracts... so I'm lookin' good to go.

Once again, a BIG Thank You to the forumites for donating your valuable time. I really appreciate it.

Cheers all. Keith.

using System;
using System.Reflection;
using System.Xml;
using System.Xml.Serialization;
using System.IO;
using System.Collections.Generic;

using nrw_rime_extract.utils;
using nrw_rime_extract.xml.generated_bindings;

namespace nrw_rime_extract.xml
{
    internal interface ExtractXmlReader
    {
        rimeType read(string xmlFilename);
    }

    /// <summary>
    /// RimeExtractXml provides bindings to the RIME Extract XML as defined by
    /// $/Release 2.7/Documentation/Technical/SCHEMA and DTDs/nrw-rime-extract.xsd
    /// </summary>
    internal class ExtractXmlReader_XmlSerializerImpl : ExtractXmlReader
    {
        private Log log = Log.getInstance();

        public rimeType read(string xmlFilename)
        {
            log.write(
                string.Format(
                    "DEBUG: ExtractXmlReader_XmlSerializerImpl.read({0})",
                    xmlFilename));
            using (Stream stream = new FileStream(xmlFilename, FileMode.Open))
            {
                return read(stream);
            }
        }

        internal rimeType read(Stream xmlInputStream)
        {
            // create an instance of the XmlSerializer class, 
            // specifying the type of object to be deserialized.
            XmlSerializer serializer = new XmlSerializer(typeof(rimeType));
            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);
            serializer.UnknownAttribute += 
                new XmlAttributeEventHandler(handleUnknownAttribute);
            // use the Deserialize method to restore the object's state
            // with data from the XML document.
            return (rimeType)serializer.Deserialize(xmlInputStream);
        }

        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",
                    e.LineNumber, e.LinePosition, e.Name, e.Text));
        }

        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",
                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));
        }

    }

    /// <summary>
    /// xtractXmlReader provides bindings to the extract.xml 
    /// returned by the RIME server; as defined by:
    ///   $/Release X/Documentation/Technical/SCHEMA and 
    /// DTDs/nrw-rime-extract.xsd
    /// </summary>
    internal class ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl :
        ExtractXmlReader
    {
        private Log log = Log.getInstance();

        public rimeType read(string xmlFilename)
        {
            log.write(
                string.Format(
                    "DEBUG: ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl." +
                    "read({0})",
                    xmlFilename));

            using (XmlReader reader = XmlReader.Create(xmlFilename))
            {
                return read(reader);
            }

        }

        public rimeType read(XmlReader reader)
        {
            rimeType result = new rimeType();
            // a deserializer for featureClass, feature, etc, "doclets"
            Dictionary<Type, XmlSerializer> serializers = 
                new Dictionary<Type, XmlSerializer>();
            serializers.Add(typeof(featureClassType), 
                newSerializer(typeof(featureClassType)));
            serializers.Add(typeof(featureType), 
                newSerializer(typeof(featureType)));

            List<featureClassType> featureClasses = new List<featureClassType>();
            List<featureType> features = new List<featureType>();
            while (!reader.EOF)
            {
                if (reader.MoveToContent() != XmlNodeType.Element)
                {
                    reader.Read(); // skip non-element-nodes and unknown-elements.
                    continue;
                }

                // skip junk nodes.
                if (reader.Name.Equals("featureClass"))
                {
                    using (
                        StringReader elementReader =
                            new StringReader(reader.ReadOuterXml()))
                    {
                        XmlSerializer deserializer =
                            serializers[typeof (featureClassType)];
                        featureClasses.Add(
                            (featureClassType)
                            deserializer.Deserialize(elementReader));
                    }
                    continue;
                    // ReadOuterXml advances the reader, so don't read again.
                }

                if (reader.Name.Equals("feature"))
                {
                    using (
                        StringReader elementReader =
                            new StringReader(reader.ReadOuterXml()))
                    {
                        XmlSerializer deserializer =
                            serializers[typeof (featureType)];
                        features.Add(
                            (featureType)
                            deserializer.Deserialize(elementReader));
                    }
                    continue;
                    // ReadOuterXml advances the reader, so don't read again.
                }

                log.write(
                    "WARNING: unknown element '" + reader.Name +
                    "' was skipped during parsing.");
                reader.Read(); // skip non-element-nodes and unknown-elements.
            }
            result.featureClasses = featureClasses.ToArray();
            result.features = features.ToArray();
            return result;
        }

        private XmlSerializer newSerializer(Type elementType)
        {
            XmlSerializer serializer = new XmlSerializer(elementType);
            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);
            serializer.UnknownAttribute += 
                new XmlAttributeEventHandler(handleUnknownAttribute);
            return serializer;
        }

        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",
                    e.LineNumber, e.LinePosition, e.Name, e.Text));
        }

        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",
                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));
        }
    }
}

Solution 3

Just to summarise, and make the answer a bit more obvious for anyone who finds this thread in google.

Prior to .NET 2 the XmlTextReader was the most memory efficient XML parser available in the standard API (thanx Mitch;-)

.NET 2 introduced the XmlReader class which is better again It's a forward-only element iterator (a bit like a StAX parser). (thanx Cerebrus;-)

And remember kiddies, of any XML instance has the potential to be bigger than about 500k, DON'T USE DOM!

Cheers all. Keith.

Solution 4

A SAX parser might be what you're looking for. SAX does not require you to read the entire document into memory - it parses through it incrementally and allows you to process the elements as you go. I don't know if there is a SAX parser provided in .NET, but there are a few opensource options that you could look at:

Here's a related post:

Solution 5

Just wanted to add this simple extension method as an example of using XmlReader (as Mitch answered):

public static bool SkipToElement (this XmlReader xmlReader, string elementName)
{
    if (!xmlReader.Read ())
        return false;

    while (!xmlReader.EOF)
    {
        if (xmlReader.NodeType == XmlNodeType.Element && xmlReader.Name == elementName)
            return true;

        xmlReader.Skip ();
    }

    return false;
}

And usage:

using (var xml_reader = XmlReader.Create (this.source.Url))
{
    if (!SkipToElement (xml_reader, "Root"))
        throw new InvalidOperationException ("XML element \"Root\" was not found.");

    if (!SkipToElement (xml_reader, "Users"))
        throw new InvalidOperationException ("XML element \"Root/Users\" was not found.");

    ...
}
Share:
70,463
corlettk
Author by

corlettk

I'm not a bad hacker... A red hat of course.

Updated on July 08, 2022

Comments

  • corlettk
    corlettk almost 2 years

    I'm writing a GIS client tool in C# to retrieve "features" in a GML-based XML schema (sample below) from a server. Extracts are limited to 100,000 features.

    I guestimate that the largest extract.xml might get up around 150 megabytes, so obviously DOM parsers are out I've been trying to decide between XmlSerializer and XSD.EXE generated bindings --OR-- XmlReader and a hand-crafted object graph.

    Or maybe there's a better way which I haven't considered yet? Like XLINQ, or ????

    Please can anybody guide me? Especially with regards to the memory efficiency of any given approach. If not I'll have to "prototype" both solutions and profile them side-by-side.

    I'm a bit of a raw prawn in .NET. Any guidance would be greatly appreciated.

    Thanking you. Keith.


    Sample XML - upto 100,000 of them, of upto 234,600 coords per feature.

    <feature featId="27168306" fType="vegetation" fTypeId="1129" fClass="vegetation" gType="Polygon" ID="0" cLockNr="51598" metadataId="51599" mdFileId="NRM/TIS/VEGETATION/9543_22_v3" dataScale="25000">
      <MultiGeometry>
        <geometryMember>
          <Polygon>
            <outerBoundaryIs>
              <LinearRing>
                <coordinates>153.505004,-27.42196 153.505044,-27.422015 153.503992 .... 172 coordinates omitted to save space ... 153.505004,-27.42196</coordinates>
              </LinearRing>
            </outerBoundaryIs>
          </Polygon>
        </geometryMember>
      </MultiGeometry>
    </feature>
    
  • MrTelly
    MrTelly about 15 years
    It would be interesting to compare the performance of Sax v XmlTextReader - has anyone tried this
  • Andy White
    Andy White about 15 years
    I'd be interested too, I haven't compared them
  • Cerebrus
    Cerebrus about 15 years
    IIRC, .NET 2.0 onwards, MS recommends using the XmlReader class directly instead of the XmlTextReader.
  • corlettk
    corlettk about 15 years
    .NET doesn't provide a native sax parser, but I read an a arcticle (in slashdot, I think) which showed how easy it was to roll your own SAX-parser using the XmlReader "primitives".
  • corlettk
    corlettk about 15 years
    @Cerebrus and Mitch: Thank you gentlemen. That's pretty much what I thought, but it's really very nice to get a second (informed) opinion before wasting days persuing potentially the wrong path. Greatly appreciated!
  • corlettk
    corlettk almost 11 years
    Nice... One suggested improvement: the absence if the sought element is always terminal to the current operation (it HAS to be having skipped our reader to EOF) so just throw exception directly in SkipTo instead of returning false... you've got the sought element name to report, so use it instead of repeating yourself in error messages.
  • Michael Logutov
    Michael Logutov almost 11 years
    Yeah, you right. It's just in my specific case I needed to tell the full path to the missed element and not just the name of it.
  • Andrius Bentkus
    Andrius Bentkus almost 10 years
    Is it possible to use this on chunked byte array input?
  • Solomon Duskis
    Solomon Duskis over 8 years
    "files that are up to 2 gigabytes (GB) in size" - I couldn't find a reference explaining this limit, and no-one else seems to mention it. Do you have a link explaining this limit?
  • abrown
    abrown over 7 years
    @Nickolay The 2GB limit is referenced by MSDN here: msdn.microsoft.com/en-us/library/ff647804.aspx : "You can only use XmlTextReader and XmlValidatingReader to process files that are up to 2 gigabytes (GB) in size. If you need to process larger files, divide the source file into multiple smaller files or streams."
  • Iúri dos Anjos
    Iúri dos Anjos almost 6 years
    How can I use it and still check the InnerXml.Length of each node? I tried to use "ReadInnerXml().Length" but I get OutOfMemory on large files.