How should I handle lexical errors in my Flex lexer?

10,264

Solution 1

There are lots of options. Which one is best is probably a matter of opinion. (And note that SO does not take kindly to questions whose answers are opinions rather than facts.)

It largely depends on how you handle error messages in your application in general. But here are a couple of possibilities:

  1. Print an error message directly from the lexer. Tell you error-detection system that compilation was unsuccessful: you might use a global error count (yuk, globals!), or a shared data-structure passed to yylex as an additional parameter. Then just ignore the character and continue lexing.

  2. Return something like TK_INVALID_STRING to the parser. The parser will need to have appropriate error productions in order to handle and recover from this error appropriately, which is a lot more work but has the advantage of putting all error handling into the parser. However, in the particular case of strings, you'll probably want to finish lexing the string up to the closing quote; otherwise, continuing the parse will be fruitless.

As to yyerror: there is nothing magical about yyerror. That function is completely your responsibility. The only thing that bison does is call it with a specified set of arguments. If you find it useful for recording errors noticed in the lexer (and I think it probably is), then go ahead and use it. You're totally responsible for declaring yyerror, so put its definition in whatever shared header file you #include in both the lexer and the parser. Or fiddle around with bison code generation options to get the definition included in the header file created with bison. Whatever is easier. Once you've figured out how to declare yyerror, you can define it anywhere you want: in the lexer file, in the bison file, or (my preference) in a separate library of support functions.

(FWIW, I've tried option 2, and it really seems to me like too much work; option 1 has worked fine for me. But tastes vary, and YMMV; I'm not going to defend my choice here, but I don't mind admitting to it.)

Solution 2

If you are using Bison with C++ output, another option is throwing an exception.

.   throw yy::parser::syntax_error("invalid character: " + std::string(yytext, yyleng);

If you are using Bison 3.6 or later (with all the target languages, including C), then you can also return the YYerror special token. This is similar to rici's suggestion return TK_INVALID_STRING, but then the parser would complain about this unknown TK_INVALID_STRING (so two error messages: one from your call to yyerror, another from yyparse about the unknown TK_INVALID_STRING). There is no such thing with YYerror, yet you do properly enter error-recovery.

In other words, I would suggest in C (if your yyerror supports variadic arguments):

yyerror (yylloc, _("syntax error: invalid character: %c"), c);
return YYerror;

This is an excerpt of the "bistromathic" example in Bison's distro (available in /usr/local/share/doc/bison/examples in typical distros, or on Savannah and GitHub).

Solution 3

The simplest thing is to just have a final rule

. return yytext[0];

This covers all the single special characters and all the illegal ones as well. Use special characters directly in your grammar, as ":", ";", etc. Then if you get an illegal character the parser's error-handling is invoked, which gives some prospect of recovery. If you handle them in the lexer all you can do is print an error and ignore them.

It also cuts down the size of your lex file.

Share:
10,264
Rohan Desai
Author by

Rohan Desai

Updated on June 29, 2022

Comments

  • Rohan Desai
    Rohan Desai almost 2 years

    I'm currently trying to write a small compiler using Flex+Bison but I'm kinda of lost in terms of what to do with error handlling, specially how to make everything fit together. To motivate the discussion consider the following lexer fragment I'm using for string literals:

    ["]          { BEGIN(STRING_LITERAL); init_string_buffer(); }
    <STRING_LITERAL>{
        \\\\    { add_char_to_buffer('\\'); }
        \\\"    { add_char_to_buffer('\"'); }
        \\.     { /*Invalid escape. How do I treat this error?*/ }
        ["]     { BEGIN(INITIAL); yylval = get_string_buffer(); return TK_STRING; }
    }
    

    How do I handle the situation with invalid escapes? Right now I'm just printing an error message and calling exit but I'd prefer to be able to keep going and detect more than one error per file if possible.

    My questions:

    • What function do I use to print out error messages? The same yyerror expected by bison later on? Where do I put the definition of yyerror if I have separate files for the lexer and parser?
    • What token code should I return from my action? 0 for "end of file"? Some special TK_INVALID_STRING token?
    • How do I make sure the parser can continue parsing after lexical errors (invalid literals, stray punctuation characters, etc)?
  • Rohan Desai
    Rohan Desai over 10 years
    If you use option 1, what do you have the lexer return when it finds an invalid string literal? Do you pretend its a valid one (return TK_STRING) and then have the high level code responsible for calling the parser check the global error variable?
  • rici
    rici over 10 years
    @missingno: precisely. That's the easiest way to continue the parse. You don't need to check for errors until you're ready to generate code. In this sense, it's not different from an out-of-range integer, for example: you want to make sure the compile fails, but from a parsing viewpoint, you should be able to just continue in order to check the rest of the syntax.
  • akim
    akim over 10 years
    FWIW I often use option 1, but calling yyerror to issue the error message, so that all the "syntactic errors" be handled at a single place.
  • rici
    rici over 10 years
    @akim: Please search for "And I think it probably is" in my answer.
  • rici
    rici over 10 years
    That's a reasonable approach, but how do you then handle the last question in the OP: "How do I make sure the parser can continue parsing after lexical errors?" Presumably, the flex internals are toast once you longjmp out of them, ¿no?
  • akim
    akim over 10 years
    @rici: the syntax_error is caught by the generated parser, which then fires the regular error reporting/recovery sequence. As for the yylex function itself, I don't see what can happen to it: its control flow is designed to be breakable via return;, continue; etc. In practice, I've never had any problem.