Why can't C++ be parsed with a LR(1) parser?

c++ parsing grammar formal-languages

32,356

Solution 1

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

The example is:

int(x), y, *const z;

meaning

int x;
int y;
int *const z;

Compare to:

int(x), y, new int;

meaning

(int(x)), (y), (new int));

(a comma-separated expression).

The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.

Solution 2

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

It can be the declaration of y, as pointer to type x
It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there are two different parses, and therefore a program can mean different things depending on how this should have been parsed.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.

Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.

There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.

Most real C/C++ parsers handle this example by using some kind of deterministic parser with an extra hack: they intertwine parsing with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones, etc.) are (at best) context free.

One can cheat, and add per-rule reduction-time semantic checks in the to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types have some means to add semantic checks at various points in the parsing, that can be used to do this.

And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.

There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept both parses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for our DMS Software Reengineering Tookit (as of June 2017 these handle full C++17 in MS and GNU dialects). They have been used to process millions of lines of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)

Solution 3

The problem is never defined like this, whereas it should be interesting :

what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)

I see a few ones:

Type Type; is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note that struct Type Type is not ambiguous and could be still allowed).

There are 3 types of names tokens :
- types : builtin-type or because of a typedef/class/struct
- template-functions
- identifiers : functions/methods and variables/objects
Considering template-functions as different tokens solves the func< ambiguity. If func is a template-function name, then < must be the beginning of a template parameter list, otherwise func is a function pointer and < is the comparison operator.
Type a(2); is an object instantiation. Type a(); and Type a(int) are function prototypes.
int (k); is completely forbidden, should be written int k;
typedef int func_type(); and typedef int (func_type)(); are forbidden.

A function typedef must be a function pointer typedef : typedef int (*func_ptr_type)();
template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.
int a,b,c[9],*d,(*f)(), (*g)()[9], h(char); could be forbidden too, replaced by int a,b,c[9],*d; int (*f)();

int (*g)()[9];

int h(char);

one line per function prototype or function pointer declaration.

An highly preferred alternative would be to change the awful function pointer syntax,

int (MyClass::*MethodPtr)(char*);

being resyntaxed as:

int (MyClass::*)(char*) MethodPtr;

this being coherent with the cast operator (int (MyClass::*)(char*))
typedef int type, *type_ptr; could be forbidden too : one line per typedef. Thus it would become

typedef int type;

typedef int *type_ptr;
sizeof int, sizeof char, sizeof long long and co. could be declared in each source file. Thus, each source file making use of the type int should begin with

#type int : signed_integer(4)

and unsigned_integer(4) would be forbidden outside of that #type directive this would be a big step into the stupid sizeof int ambiguity present in so many C++ headers

The compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move source.cpp too an ambiguous_syntax folder, and would create automatically an unambiguous translated source.cpp before compiling it.

Please add your ambiguous C++ syntaxes if you know some!

Solution 4

As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).

Solution 5

I think you are pretty close to the answer.

LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.

View more solutions

32,356

Cheery

A programmer hobbyist.

Updated on May 21, 2020

Comments

Cheery almost 4 years

I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:

Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.

Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?

Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).
- Toon Krijthe over 15 years
  
  Just like: you need to understand recursion to learn recursion ;-).
- ilya n. almost 15 years
  
  "You'll understand parsers once you'll parse this phrase."
Cheery over 15 years

It'd be cool to have some summary about the page 147 on this page. I'm going to read that page though. (+1)
rmeador over 15 years

I recall from my compilers class that LR(n) for n > 0 is mathematically reducable to LR(1). Is that not true for n = infinity?
ephemient over 15 years

No, there's an impassable mountain of a difference between n and infinity.
Steve Fallows over 15 years

Isn't the answer: Yes, given an infinite amount of time? :)
Aaron over 15 years

Actually, by my vague recollection of how LR(n) -> LR(1) takes place, it involves creating new intermediate states, so the runtime is some non-constant function of 'n'. Translating LR(inf) -> LR(1) would take infinite time.
ChrisW almost 15 years

"Isn't the answer: Yes, given an infinite amount of time?" -- No: the phrase 'given an infinite amount of time' is just a non-sensical, short-hand way of saying "cannot be done given any finite amount of time". When you see "infinite", think: "not any finite".
Martin Cote almost 15 years

While the 'x * y' example is interesting, the same can happen in C ('y' can be a typedef or a variable). But C can be parsed by a LR(1) parser, so what's the difference with C++?
Ira Baxter almost 15 years

My answser had already observed that C had the same problem, I think you missed that. No, it can't be parsed by LR(1), for the same reason. Er, what do you mean 'y' can be a typedef? Perhaps you meant 'x'? That doesn't change anything.
Ira Baxter over 14 years

Parsing technology that handles ambiguity simply produces both AST variants as they parse, and simply eliminate the incorrect one depending on the type information.
Sam Harwell over 14 years

@Ira: Yes, that is correct. The particular advantage of that is it allows you maintain the separation of the first-stage parse. While it's most commonly known in the GLR parser, there's no particular reason I see that you couldn't hit C++ with a "GLL?" parser as well.
Dour High Arch over 14 years

Parse 2 is not necessarily stupid in C++, as * could be overridden to have side effects.
Ira Baxter over 14 years

In the provided example, it doesn't have any side effects. You are right, there's variations on this theme (say, overloaded "") which could have side effects. As stated, people *might think it was stupid. I've adjusted the text.
Ira Baxter over 14 years

"GLL"? Well, sure, but you'll have to go figure out the theory and write up a paper for the rest of to use. More likely, you could use a top down hand coded parser, or a backtracking LALR() parser (but keep the "rejected") parses, or run an Earley parser. GLR has the advantage of being a pretty damn good solution, is well documented and by now well proved. A GLL technology would have to have some pretty significant advantages to display GLR.
Aaron Altman over 13 years

I would count overloading a stateless operator to include side effects as a dumb use (flame bait but I'll elaborate). I always had that problem with operator overloading: none of the properties of an operator that would make it a reasonable analogy to the original use are guaranteed. For example, the comparison operators < > <= >= absolutely don't have to work as a partial order.
Ira Baxter over 13 years

@altie: agreed, when you have overloading you want something like the Liskov substitution principle, which says the key properties of the overloaded operator are preserved. Alas, none of the OO langauges I know about (except Liskov's CLU) insist on that. So, you can object that the overload in C++ is dumb, but it is legal.
Ira Baxter over 13 years

The Rascal project (Netherlands) is claiming they are building a scannerless GLL parser. Work in progress, might be hard to find any online information. en.wikipedia.org/wiki/RascalMPL
new123456 almost 13 years

I looked at x * y and chuckled - it's amazing how little one thinks of nifty little ambiguities like this.
Sjoerd over 12 years

@IraBaxter There seems to be new developments on GLL: see this 2010 paper about GLL dotat.at/tmp/gll.pdf
Blaisorblade over 12 years

The question is about the difference between C and C++ - C can be parsed by LALR(1) with the lexer hack (which you describe), while it seems (from other answers) that even that is impossible for C++.
Blaisorblade over 12 years

The example is: int(x), y, *const z; //meaning: int x; int y; int *const z; (a sequence of declarations) int(x), y, new int; //meaning: (int(x)), (y), (new int)); (a comma-separated expression) The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.
Ira Baxter over 12 years

@Blaisorblade: What people do is get some parser technology that works for most cases, and then hack it handle other cases, either by accepting "too much" and eliminating that later, or by bending the parsing machinery (lexer hack as one example, ad hoc look-ahead, etc.) So I think you can do it; its just awful.
Blaisorblade over 12 years

@IraBaxter: indeed, apparently the PhD thesis cited by Rob Walker, about Fog, implements a parser for C++ (in the middle of implementing some other goal) consisting of a LALR(1) superset grammar and defers significant processing to resolve ambiguities to subsequent stages. So in a way you're right - although this is much uglier than what's needed for C. In another sense, though, not just C but many other programming languages have some context-sensitive semantic processing after parsing, like for instance identifier resolution and type-checking (I don't know about scripting languages).
Ira Baxter over 12 years

@Blaisorblad: you don't have a lot of choice. If your parser accepts "too little", you probably reject real programs. You can't make a parser accept exactly a valid string unless your parser is Turing capable. So, you build a context free parser that accepts too much, and hack away the excess that it accepts.
Ira Baxter about 11 years

Infinite lookahead may be necessary, but it isn't sufficient. See my answer on the necessity to use context to resolve the choice.
Ira Baxter about 11 years

C++ is too well entrenched. Nobody will do this in practice. Those folks (like us) that build front ends simply bite the bullet and do the engineering to make the parsers work. And, as long as templates exist in the language, you're not going to get a pure context-free parser.
Troy Daniels about 11 years

@altie Surely no one would overload a bit-shift operator to make it write most variable types to a stream, right?
MauganRa almost 10 years

Well, in that context ∞ means "arbitrarily many" because the lookahead will always be bounded by the input length.
Ira Baxter over 4 years

@Troy: How is that different than Java overloading the add operator "+" to concenate two strings?
Troy Daniels over 4 years

@IraBaxter There are similarities, although it seems like concatenation is a logical interpretation of "adding" when applied to strings, much more than bit-shifting an output stream. If I asked a non-programmer to "add 'abc' to 'def'", I feel I have a decent chance of getting 'defabc' as the answer. If I ask them to "shift the output by 'xyzzy' bits", I feel a very confused expression is about the best I can hope for.
Ira Baxter over 4 years

@Troy: yes, I kind of agree about the intuition (although if you ask most non-C/C++ programmers what "<<" does, you won't get an answer). But in either case, the semantics of the different interpretations are simply... different. So its case of what arbitrary semantics, do you want to attach to what arbitrary syntax?
dodecaplex over 4 years

I'm quite puzzled by the citation extracted from a PhD Thesis. If there is an ambiguïty, then, by definition, NO lookahead may ever "resolve" the ambiguity (i.e. decide which parse is the correct oen, as at least 2 parses are considered correct by the grammar). Moreover, the citation mentions the ambiguity of C but the explanation, does not show an ambiguity, but only a non ambiguous example where parsing decision can only be taken after an arbitrary long look-ahead.
dodecaplex over 4 years

(x)-x is also an example of ambiguity, where (x) may be interpreted as an expression or as a type casting. Moreover, this expression may be used in any meaningful context, avoiding the discussion about overloading *.
Ira Baxter over 4 years

That's the standard hack used by GCC with its former LR parser to handle the ambiguity of things like "x*y;" Alas, there's still the arbitrarily large lookahead requirement to parse other constructs, so LR(k) fails to be a solution any fixed k. (GCC switched to recursive descent with more ad hockery).
Ira Baxter over 2 years

Grasshopper: you are swimming against the current. You can argue that C's grammar is unnecessarily complex; if you follow that argument to the limit you can always define your language as S-expressions as LISP does, and then parsers are basically trivial. If you think that LISP S-expressions are clumsy, then you will design a language with lots of idiomatic forms. Over time, people graft extra stuff in. ....
Ira Baxter over 2 years

... The modern version of pretty much every language I'm familiar with, has things that make recursive descent parsing harder than you'd like. If you use the right parser generator, you stop caring because these variations are all handled by the parser generator without fuss. (Seriously, check out GLL or GLR Parsing). Regarding C, you'll find parsing it harder than you think due to the need to implement an accurate preprocessor, and (if you insist on recursive descent), the need to track types during parsing. That latter for me tangles parsing and symbol tables, and that's just stupid.
Paul B Mann over 2 years

LRSTAR is at sourceforge.net/projects/lrstar