Generate an AST in C++

c++ algorithm parsing abstract-syntax-tree lalr

14,467

Solution 1

I'm going to go against conventional wisdom here and say that you should build your AST manually with natural language-specific data-structures.

A generic "AST data structure" will be too general to work with comfortably -- the code that consumes the AST to do anything useful with it will be obscured by the workarounds to access the data it wants. If you go this route (or use a parser generator), you'll likely end up building a translation layer to go from the generic structure to an AST that actually makes sense for your language. Why not avoid the middleman and build the final data structure directly?

I suggest writing a first syntactic pass that consumes the tokens it needs for each possible construct (probably peeking ahead one token as needed). This syntactic pass would construct the AST out of linked instances of data structures that represent each possible construct in your language. For example, if your program can consist of a series of statements, where each statement can be either a function declaration or a function call, you could create the following data structures:

AST (struct)
    -> std::vector<Statement*> statements

StatementKind (enum class)
    -> Function
    -> Call

Statement (struct)
    -> StatementKind kind

Function : Statement (struct)
    -> std::string name
    -> std::vector<Statement*> body

Call : Statement (struct)
    -> std::string name
    -> Function* called

Then the syntactic pass to build the initial AST might look like:

auto ast = parseAST();

where parseAST calls parseStatement repeatedly, which consumes and/or peeks at tokens to determine whether the statement is a function definition or a function call, then calls parseFunction or parseCall appropriately. This is called a hand-written recursive descent parser, and is in my experience by far the simplest and best type of parser you can write. Yes there is boilerplate to write, but it's also very clear and flexible code.

The syntactic phase generates syntax error messages as it goes, attempting to recover as best as it can when an error is found (this is one argument for semicolon-delimited languages -- the compilers and tools can often recover from errors much more easily, since it often suffices to skip to the next semicolon when an error is encountered before trying to parse the next construct).

If a function is allowed to be called before it is defined, this is simple to handle too -- just parse the call part without checking if a function with that name exists at the time you first parse it, then correlate it later.

A second, semantic, pass through the AST would annotate it with any missing cross-referenced data (e.g. resolving function calls' function names with those functions' definitions, or generating an error if the name is not found). Once that's done, you have a full AST that's guaranteed to represent a valid program (since you did all the semantic error checking in this pass), and can have optimizations applied to it, followed by the code generation phase (and more optimizations along the way).

Note that I completely left out any mention of LL or LR parsers, etc. The theoretical notation and categorization of your language is important (as it can prove it's non-ambiguous, for example), but from an implementation point of view, it's far more important to have clean, easily understood and above all easily modifiable code than to adhere to a specific parsing mechanism. Examples of other compilers that use hand-written parsers include GCC, Clang, and Google's V8, among others.

In terms of efficiency, the AST is generally iterated over several times after it's constructed, and needs to stick around in memory until late in the process (possibly right until the end, depending on how you do your code generation), and so it's best to use an object pool to allocate the records for your AST than to dynamically allocate everything separately on the heap. Finally, in place of strings, it's generally better to use an offset-and-length into the original source code.

Solution 2

You could use some parser generator like bison or ANTLR. Both have good documentations with tutorial part. The action part of your grammar rules (fed to bisonor antlr, which generates C++ code for parsing) would build the abstract syntax tree.

We can't help more without knowing more the syntax (and the semantics) of the formal language you want to parse and interpret.

If your language is an infix calculator, bison has such an example.

You probably should think of a class hierarchy to represent the nodes of your AST. You'll have a class for leafs (e.g. numbers), a class for addition node (with two sons as smart pointers to other nodes), etc...

e.g.

class ASTNode {
/// from http://stackoverflow.com/a/28530559/841108
///... add some things here, e.g. a virtual print method
};

class NumberNode : public ASTNode {
   long number;
   /// etc...
};

class BinaryOpNode : public ASTNode {
   std::unique_ptr<ASTNode> left;
   std::unique_ptr<ASTNode> right;
 /// etc....
};

class AdditionNode : public BinaryOpNode {
/// etc....
};

class CallNode : public ASTNode {
   std::shared_ptr<Function> func;
   std::vector<std::unique_ptr<ASTNode>> args;
};

for nodes of variadic arity (i.e. any number of sons) you'll need a vector of smart pointers, like args above.

To traverse the tree, you'll do a recursive traversal so you better use smart pointers. Read also about the visitor pattern. And with C++11 std::function-s and anonymous closures -i.e lambdas- , you could have a visit virtual function (to which you give a closure visiting each node). The Unix nftw(3) function which visits file trees could be inspirational.

Solution 3

People build ASTs as heap-allocated trees. (Yes, you can do it an array but it just isn't as convenient). I suggest you READ the bison documentation; it will explain to you how to build trees and you can follow that style.

Guessing at your experience level based on your question, if I were you I'd build a flex/bison based parser/AST builder at least once to get a good experience, before you came back to building everything yourself.

Solution 4

I used a set of BNF production methods to generate nodes (struct inheritance) of a specific type for generating the AST. With std::move(), you can transfer the pointer ownership to avoid dangling pointers. Then there is a public recursive visit method (switch-case) that traverses the AST following a certain traversal pattern (post/pre order), checks for the AST node type and executes an accept() for each visit. The accepts are bind to a dispatcher interface which must be implemented by the user (print the tree, execute the tree, etc.). For each visit the corresponding method on user side is called.

View more solutions

14,467

Author by

Francis

Updated on July 21, 2022

Comments

Francis almost 2 years

I'm making an interpreter in C++, so far I've got my lexer to generate tokens. The problem is I'm not sure how to generate an "walk" a parse tree.

I was thinking of making my parse tree using an array of arrays, but I'm not sure how to actually insert the tokens into the parse tree in the correct order.

I'm not sure whether or not to go top-down, left-right or bottom-up, right-left.

Can anyone provide me with a simple LALR(1) algorithm?