What's the best way to write a parser by hand?

10,522

Solution 1

I would highly recommend the F# language as your language of choice for parsing on the .NET Platform. It's roots in the ML family of languages means it has excellent support for language-oriented programming.

Discriminated unions and pattern-matching allow for a very succinct and powerful specification of your AST. Higher-order functions allow for definition of parse operations and their composition. First-class support for monadic types allows for state management to be handled implicitly greatly simplifying the composition of parsers. Powerful type-inference greatly aides the definition of these (complex) types. And all of this can be specified and executed interactively allowing you to rapidly prototype.

Stephan Tolksdorf has put this into practice with his parser combinator library FParsec

From his examples we see how naturally an AST is specified:

type expr =
    | Val of string
    | Int of int
    | Float of float
    | Decr of expr

type stmt =
    | Assign of string * expr
    | While of expr * stmt
    | Seq of stmt list
    | IfThen of expr * stmt
    | IfThenElse of expr * stmt * stmt
    | Print of expr

type prog = Prog of stmt list

the implementation of the parser (partially elided) is just as succinct:

let stmt, stmtRef = createParserForwardedToRef()

let stmtList = sepBy1 stmt (ch ';')

let assign =
    pipe2 id (str ":=" >>. expr) (fun id e -> Assign(id, e))

let print = str "print" >>. expr |>> Print

let pwhile =
    pipe2 (str "while" >>. expr) (str "do" >>. stmt) (fun e s -> While(e, s))

let seq =
    str "begin" >>. stmtList .>> str "end" |>> Seq

let ifthen =
    pipe3 (str "if" >>. expr) (str "then" >>. stmt) (opt (str "else" >>. stmt))
          (fun e s1 optS2 ->
               match optS2 with
               | None    -> IfThen(e, s1)
               | Some s2 -> IfThenElse(e, s1, s2))

do stmtRef:= choice [ifthen; pwhile; seq; print; assign]


let prog =
    ws >>. stmtList .>> eof |>> Prog

On the second line, as an example, stmt and ch are parsers and sepBy1 is a monadic parser combinator that takes two simple parsers and returns a combination parser. In this case sepBy1 p sep returns a parser that parses one or more occurrences of p separated by sep. You can thus see how quickly a powerful parser can be combined from simple parsers. F#'s support for overridden operators also allow for concise infix notation e.g. the sequencing combinator and the choice combinator can be specified as >>. and <|>.

Best of luck,

Danny

Solution 2

The only kind of parser that can be handwritten by a sane human being is a recursive-descent. That said, writing bottom-up parser by hand is still possible but is very undesirable.

If you're up for RD parser you have to verify that SQL grammar is not left-recursive (and eliminate the recursion if necessary), and then basically write a function for each grammar rule. See this for further reference.

Solution 3

Adding my voice to the chorus in favor of recursive-descent (LL1). They are simple, fast, and IMO, not at all hard to maintain.

However, take a good look at your language to make sure it is LL1. If you have any syntax like C has, like ((((type))foo)[ ]) where you might have to descend multiple layers of parentheses before you even find out if you are looking at a type, variable, or expression, then LL1 will be very difficult, and bottom-up wins.

Solution 4

Recursive descent will give you the simplest way to go, but I would have to agree with mouviciel that flex and bison and definitely worth learning. When you find out you have a mistake in your grammar, fixing a definition of the language in flex /bison will be a hell of a lot easier then rewriting your recursive descent code.

FYI the C# parser is written recursive descent and it tends to be quite robust.

Solution 5

Recursive Descent parsers are indeed the best, maybe only, parsers that can be built by hand. You will still have to bone-up on what exactly a formal, context-free language is and put your language in a normal form. I would personally suggest that you remove left-recursion and put your language in Greibach Normal Form. When you do that, the parser just about writes itself.

For example, this production:

A => aC 
A => bD
A => eF

becomes something simple like:

int A() {
   chr = read();
   switch char
     case 'a': C();
     case 'b': D();
     case 'e': F();
     default: throw_syntax_error('A', chr);
}

And there aren't any much more difficult cases here (what's more difficult is make sure your grammar is in exactly the right form but this allows you the control that you mentioned).

Anton's Link is also seems excellent.

Share:
10,522

Related videos on Youtube

Simon
Author by

Simon

Updated on March 12, 2020

Comments

  • Simon
    Simon about 4 years

    We've used ANTLR to create a parser for a SQL-like grammar, and while the results are satisfactory in most cases, there are a few edge cases that we need to fix; and since we didn't write the parser ourselves we don't really understand it well enough to be able to make sensible changes.

    So, we'd like to write our own parser. What's the best way to go about writing a parser by hand? What sort of parser should we use - recursive descent has been recommended; is that right? We'll be writing it in C#, so any tutorials for writing parsers in that language would be gratefully received.

    UPDATE: I'd also be interested in answers that involve F# - I've been looking for a reason to use that in a project.

    • Sam Harwell
      Sam Harwell over 14 years
      Simon, I'm looking through your posts and you mention that you've "decided to do it by hand." Are you interested in an exercise here to learn about parsing, or are you after a semantically correct, maintainable, fast parser for use? If the latter, I believe your decision was premature. You're going to get so tied up in the parsing logic that you'll soon forget about the "few edge cases" you set out to correct.
    • Simon
      Simon over 14 years
      The latter. We've gone down the parser-generator route and ended up with something we don't understand and therefore can't fix. I'd rather have something that's going to take months of work, but which is fixable, than something quick that isn't.
    • Eric
      Eric over 14 years
      I'm a bit puzzled by this. Are you just not familiar enough with how parser generators work? You shouldn't have to play around with the generated code at all if you've done it correctly.
    • Simon
      Simon over 14 years
      @Eric: Perhaps. I'm confident that I've understood it well enough to make sure that the output is correct, but there may be some art to making it also be fast. All I know is what we have is impenetrable and slow.
    • schoetbi
      schoetbi over 13 years
      I also used ANTLR and now I consider to rewrite the parser by hand since to solve these "edget cases" it needs a vast amount of time.
  • Simon
    Simon about 15 years
    No, we're pretty much decided on writing one by hand. Thanks, though.
  • Stephan Eggermont
    Stephan Eggermont about 15 years
    And the right thing to do is to change the language to LL1
  • Mike Dunlavey
    Mike Dunlavey about 15 years
    @Stephan: I'm inclined to agree, though on one or two occasions I've had no choice - I had to parse C.
  • Simon
    Simon over 14 years
    I'd like to write it by hand - without using a parser generator.
  • Simon
    Simon over 14 years
    Thanks but I'd like to write it by hand.
  • Simon
    Simon over 14 years
    No offence taken. Rewriting the parser as standards change isn't a problem here; it's our own SQL-like grammar, and the parser will be the reference implementation.
  • Ira Baxter
    Ira Baxter over 14 years
    If its a reference implementation then DO NOT code a recursive descent parser. RD parsers are all procedural code, and that's much harder to understand than a specification. You absolutely want a clean, clean specificaiton of your langauge in clearly defined langauge rules so that others ("refererence", remember?) can clearly see what was specified. ANTLR is far better for this than RD. Bison using the GLR option would be better than ANTLR because you can code context-free grammar rules with impunity.
  • Stephen C
    Stephen C over 14 years
    +1 for the point about reference implementations. By using a parser generator you validate the published grammar, and avoid problems that your (hand written) parser doesn't actually implement the grammar.
  • Eric
    Eric over 14 years
    flex, while GPLed, is not a GNU project.
  • mike g
    mike g over 14 years
    Or you just have logic in your parser if(type) ... else if(variable) ...
  • mike g
    mike g over 14 years
    A hell of a lot easier is an exageration. The methods in a recursive descent parser are the grammar (and its simple once this is recognized) ...
  • pjp
    pjp over 14 years
    +1 for Greibach Normal Form - takes me back to Uni days
  • Spence
    Spence over 14 years
    Depends on where your mistake is. WIth recursive descent you may semantically change the meaning of an entire section of the grammar. In a flex/bison definition this would result in a 1 line change. In recursive descent you may have to rewrite half your code to deal with it.
  • Jason D
    Jason D about 14 years
    If your code "looks like" EBNF, then your code is only marginally more difficult to maintain than the EBNF. One should always start with the grammar when crafting a parser, or changing it.
  • James Fremen
    James Fremen almost 9 years
    i think this assumes the language is LL(1).