A lightweight XML parser efficient for large files?

10,040

Solution 1

If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.

The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.

Solution 2

I like ExPat
http://expat.sourceforge.net/

It is C based but there are several C++ wrappers around to help.

Solution 3

RapidXML is quite a fast parser for XML written in C++.

Solution 4

http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)

I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.

I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.

The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.

Solution 5

If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.

See also the SAX2 interface in libxml

Share:
10,040
Alex Jenter
Author by

Alex Jenter

Author of CintaNotes, a lightweight and fast personal notes manager for Windows.

Updated on July 29, 2022

Comments

  • Alex Jenter
    Alex Jenter almost 2 years

    I need to parse potentially huge XML files, so I guess this rules out DOM parsers.

    Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint? The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.

    I know about Xerces, but its sheer size of over 50mb gives me shivers.

    Thanks!

  • Alex Jenter
    Alex Jenter almost 15 years
    Thanks for the answer. Is LibXML lightweight? How many kbytes does it add to the executable?
  • Tony Miller
    Tony Miller almost 15 years
    If you're using a dynamic library (UNIX shared lib / Windows DLL), then the answer is "none". Just a quick check on my Linux box shows that the shared lib is 1.2M and the static library (to be used in compiling in to programs) is 1.5M. So if you did a static compile you'd be adding 1.5M-ish to your exe.
  • Alex Jenter
    Alex Jenter almost 15 years
    My whole .exe is around 350Kb, so I guess I'll be willing to find something more lightweight.. but thanks anyway
  • Tony Miller
    Tony Miller almost 15 years
    If you're truly worried about size, try Expat at expat.sourceforge.net It's shared library size on my Linux box is 133K. I'm guessing that a statically compiled .a into your code would be that much or so.
  • Dolphin
    Dolphin almost 15 years
    It parses a character at a time and uses an int for the character. For element an attribute names, it has a rather restrictive definition of what a valid identifier is (ascii basically) but it probably wouldn't take much to change it. It comes with a project that does a parse/serialize test, so it is pretty easy to run it across some representative data to try it out.
  • Alex Jenter
    Alex Jenter almost 15 years
    Thanks for the answer, I'll look into it
  • Stephen J
    Stephen J about 12 years
    Crashes on Android. Can't use Exceptions
  • Nick
    Nick over 11 years
    This is a DOM parser, but it parses "in-situ", i.e. it changes the source XML data, so you have to load all of the data.
  • James McLaughlin
    James McLaughlin over 6 years
    The question asked for a SAX parser. Not really viable to load extremely large XML files into a DOM structure.