Good tools for creating a C/C++ parser/analyzer

67,413

Solution 1

Parsing C++ is extremely hard because the grammar is undecidable. To quote Yossi Kreinin:

Outstandingly complicated grammar

"Outstandingly" should be interpreted literally, because all popular languages have context-free (or "nearly" context-free) grammars, while C++ has undecidable grammar. If you like compilers and parsers, you probably know what this means. If you're not into this kind of thing, there's a simple example showing the problem with parsing C++: is AA BB(CC); an object definition or a function declaration? It turns out that the answer depends heavily on the code before the statement - the "context". This shows (on an intuitive level) that the C++ grammar is quite context-sensitive.

Solution 2

You can look at clang that uses llvm for parsing.

Support C++ fully now link

Solution 3

The ANTLR parser generator has a grammar for C/C++ as well as the preprocessor. I've never used it so I can't say how complete its parsing of C++ is going to be. ANTLR itself has been a useful tool for me on a couple of occasions for parsing much simpler languages.

Solution 4

Depending on your problem GCCXML might be your answer. Basically it parses the source using GCC and then gives you easily digestible XML of parse tree. With GCCXML you are done once and for all.

Solution 5

pycparser is a complete parser for C (C99) written in Python. It has a fully configurable AST backend, so it's being used as a basis for any kind of language processing you might need.

Doesn't support C++, though. Granted, it's much harder than C.


Update (2012): at this time the answer, without any doubt, would be Clang - it's modular, supports the full C++ (with many C++-11 features) and has a relatively friendly code base. It also has a C API for bindings to high-level languages (i.e. for Python).

Share:
67,413
Matt Ball
Author by

Matt Ball

Matt Ball is a software engineer who used to work at Google.

Updated on July 13, 2022

Comments

  • Matt Ball
    Matt Ball almost 2 years

    What are some good tools for getting a quick start for parsing and analyzing C/C++ code?

    In particular, I'm looking for open source tools that handle the C/C++ preprocessor and language. Preferably, these tools would use lex/yacc (or flex/bison) for the grammar, and not be too complicated. They should handle the latest ANSI C/C++ definitions.

    Here's what I've found so far, but haven't looked at them in detail (thoughts?):

    • CScope - Old-school C analyzer. Doesn't seem to do a full parse, though. Described as a glorified 'grep' for finding C functions.
    • GCC - Everybody's favorite open source compiler. Very complicated, but seems to do it all. There's a related project for creating GCC extensions called GEM, but hasn't been updated since GCC 4.1 (2006).
    • PUMA - The PUre MAnipulator. (from the page: "The intention of this project is to provide a library of classes for the analysis and manipulation of C/C++ sources. For this purpose PUMA provides classes for scanning, parsing and of course manipulating C/C++ sources."). This looks promising, but hasn't been updated since 2001. Apparently PUMA has been incorporated into AspectC++, but even this project hasn't been updated since 2006.
    • Various C/C++ raw grammars. You can get c-c++-grammars-1.2.tar.gz, but this has been unmaintained since 1997. A little Google searching pulls up other basic lex/yacc grammars that could serve as a starting place.
    • Any others?

    I'm hoping to use this as a starting point for translating C/C++ source into a new toy language.

    Thanks! -Matt

    (Added 2/9): Just a clarification: I want to extract semantic information from the preprocessor in addition to the C/C++ code itself. I don't want "#define foo 42" to disappear into the integer "42", but remain attached to the name "foo". This, unfortunately, excludes several solutions that run the preprocessor first and only deliver the C/C++ parse tree)