Writing a parser like Flex/Bison that is usable on 8-bit embedded systems

parsing embedded bison flex-lexer avr-gcc

74,181

Solution 1

I've implemented a parser for a simple command language targeted for the ATmega328p. This chip has 32k ROM and only 2k RAM. The RAM is definitely the more important limitation -- if you aren't tied to a particular chip yet, pick one with as much RAM as possible. This will make your life much easier.

At first I considered using flex/bison. I decided against this option for two major reasons:

By default, Flex & Bison depend on some standard library functions (especially for I/O) that aren't available or don't work the same in avr-libc. I'm pretty sure there are supported workarounds, but this is some extra effort that you will need to take into account.
AVR has a Harvard Architecture. C isn't designed to account for this, so even constant variables are loaded into RAM by default. You have to use special macros/functions to store and access data in flash and EEPROM. Flex & Bison create some relatively large lookup tables, and these will eat up your RAM pretty quickly. Unless I'm mistaken (which is quite possible) you will have to edit the output source in order to take advantage of the special Flash & EEPROM interfaces.

After rejecting Flex & Bison, I went looking for other generator tools. Here are a few that I considered:

You might also want to take a look at Wikipedia's comparison.

Ultimately, I ended up hand coding both the lexer and parser.

For parsing I used a recursive descent parser. I think Ira Baxter has already done an adequate job of covering this topic, and there are plenty of tutorials online.

For my lexer, I wrote up regular expressions for all of my terminals, diagrammed the equivalent state machine, and implemented it as one giant function using goto's for jumping between states. This was tedious, but the results worked great. As an aside, goto is a great tool for implementing state machines -- all of your states can have clear labels right next to the relevant code, there is no function call or state variable overhead, and it's about as fast as you can get. C really doesn't have a better construct for building static state machines.

Something to think about: lexers are really just a specialization of parsers. The biggest difference is that regular grammars are usually sufficient for lexical analysis, whereas most programming languages have (mostly) context-free grammars. So there's really nothing stopping you from implementing a lexer as a recursive descent parser or using a parser generator to write a lexer. It's just not usually as convenient as using a more specialized tool.

Solution 2

If you want an easy way to code parsers, or you are tight on space, you should hand-code a recursive descent parser; these are essentially LL(1) parsers. This is especially effective for languages which are as "simple" as Basic. (I did several of these back in the 70s!). The good news is these don't contain any library code; just what you write.

They are pretty easy to code, if you already have a grammar. First, you have to get rid of left recursive rules (e.g., X = X Y ). This is generally pretty easy to do, so I leave it as an exercise. (You don't have to do this for list-forming rules; see discussion below).

Then if you have BNF rule of the form:

 X = A B C ;

create a subroutine for each item in the rule (X, A, B, C) that returns a boolean saying "I saw the corresponding syntax construct". For X, code:

subroutine X()
     if ~(A()) return false;
     if ~(B()) { error(); return false; }
     if ~(C()) { error(); return false; }
     // insert semantic action here: generate code, do the work, ....
     return true;
end X;

Similarly for A, B, C.

If a token is a terminal, write code that checks the input stream for the string of characters that makes up the terminal. E.g, for a Number, check that input stream contains digits and advance the input stream cursor past the digits. This is especially easy if you are parsing out of a buffer (for BASIC, you tend to get one line at time) by simply advancing or not advancing a buffer scan pointer. This code is essentially the lexer part of the parser.

If your BNF rule is recursive... don't worry. Just code the recursive call. This handles grammar rules like:

T  =  '('  T  ')' ;

This can be coded as:

subroutine T()
     if ~(left_paren()) return false;
     if ~(T()) { error(); return false; }
     if ~(right_paren()) { error(); return false; }
     // insert semantic action here: generate code, do the work, ....
     return true;
end T;

If you have a BNF rule with an alternative:

 P = Q | R ;

then code P with alternative choices:

subroutine P()
    if ~(Q())
        {if ~(R()) return false;
         return true;
        }
    return true;
end P;

Sometimes you'll encounter list forming rules. These tend to be left recursive, and this case is easily handled. The basic idea is to use iteration rather than recursion, and that avoids the infinite recursion you would get doing this the "obvious" way. Example:

L  =  A |  L A ;

You can code this using iteration as:

subroutine L()
    if ~(A()) then return false;
    while (A()) do { /* loop */ }
    return true;
end L;

You can code several hundred grammar rules in a day or two this way. There's more details to fill in, but the basics here should be more than enough.

If you are really tight on space, you can build a virtual machine that implements these ideas. That's what I did back in 70s, when 8K 16 bit words was what you could get.

If you don't want to code this by hand, you can automate it with a metacompiler (Meta II) that produces essentially the same thing. These are mind-blowing technical fun and really takes all the work out of doing this, even for big grammars.

August 2014:

I get a lot of requests for "how to build an AST with a parser". For details on this, which essentially elaborates this answer, see my other SO answer https://stackoverflow.com/a/25106688/120163

July 2015:

There are lots of folks what want to write a simple expression evaluator. You can do this by doing the same kinds of things that the "AST builder" link above suggests; just do arithmetic instead of building tree nodes. Here's an expression evaluator done this way.

October 2021:

Its worth noting that this kind of parser works when your language doesn't have complications that recursive descent doesn't handle well. I offer two kinds of complications: a) genuinely ambiguous parses (e.g., more than one way to parse a phrase) and b) arbitrarily long lookahead (e.g., not bounded by a constant). In these cases recursive descent turns into recursive descent into hell, and its time to get a parser generator that can handle them. See my bio for a system that uses GLR parser generators to handle over 50 different languages, including all these complications even to the point of ridiculousness.

Solution 3

You can use flex/bison on Linux with its native gcc to generate the code that you will then cross-compile with your AVR gcc for the embedded target.

Solution 4

GCC can cross-compile to a variety of platforms, but you run flex and bison on the platform you're running the compiler on. They just spit out C code that the compiler then builds. Test it to see how big the resulting executable really is. Note that they have run time libraries (libfl.a etc.) that you will also have to cross compile to your target.

View more solutions

74,181

Johan

Johan Van den Brande is an electronics engineer (MSc) who thought it would be better to pursue a career in software instead of hardware.

Updated on December 16, 2021

Comments

Johan over 2 years

I'm writing a small interpreter for a simple BASIC like language as an exercise on an AVR microcontroller in C using the avr-gcc toolchain.

If I were writing this to run on my Linux box, I could use flex/bison. Now that I restricted myself to an 8-bit platform, how would I code the parser?
- Steve S over 14 years
  
  Is there a specific chip you intend to use? How much ROM/RAM does it have?
- pgvoorhees over 8 years
  
  Update to @mre 's link. embedded.com has trashed their URLs. (embedded.com/design/prototyping-and-development/4024523/…)
- Jacek Cz about 8 years
  
  Seems only stack laguages (forth & Co) have chance on 2KB RAM, with kernel flashed
Johan over 14 years

I still have to investigate the size of those libraries and that is why I asked the question in the first place. I want something specifically targeted towards small MCUs.
Steve S over 14 years

Yeah, it isn't too hard to hand roll a recursive descent parser for a simple language. Remember to optimize tail calls when you can -- stack space matters a lot when you've only got a couple kilobytes of RAM.
Steve S over 14 years

@Ira -- you might want to add something about implementing a lexer/scanner from scratch also (since the OP asked about flex). It isn't quite as easy/elegant to do by hand as a parser, but I think it deserves mention.
Ira Baxter over 14 years

@Steve: noted that the terminal scanning === lexer part.
Ira Baxter over 14 years

All: yes, you can do tail call optimization. This won't matter unless you expect nesting in your parsed code to get really deep; for a BASIC code line its pretty hard to find expressions much more than 10 parathenses deep, and you can always put in a depth limit count to boot. It is true that embedded systems tend to have less stack space, so at least pay attention to your choice here.
Steve S over 14 years

Ah, I skimmed too quickly and missed that. I might have to write an answer of my own that describes another way to implement it. Nice answer, though!
mpen about 12 years

@IraBaxter: The year is 2012 ;)
Ira Baxter about 12 years

@Mark: and you can code parsers by hand if you insist (still makes sense if they are not complicated) or you can get really powerful parser generators. Your choice. See my bio if you want to fall off the cliff of powerful.
Ira Baxter about 12 years

@Mark: and it may be 2012, but the 1965 technical paper I reference is just a good now as it was then and its pretty good, especially if you don't know it.
mpen about 12 years

@IraBaxter: I wasn't implying your answer was outdated, I was pointing out that you made a typo. You wrote "EDIT MARCH 16, 2011".
Ira Baxter about 12 years

@Mark, ah, ok, thanks! Looks like the date mysteriously got fixed. Thanks, Time Lord.
Krupip over 6 years

Thank you for your answer, this is exactly what I was looking for, the top marked answer is not appropriate for quick easy languages, your other resources are also extremely mind opening and interesting as well!
Dante over 6 years

How can I handle empty strings?
Ira Baxter over 6 years

By empty string, I think you are saying you have a grammar rule like X -> Y | epsilon. In this case you write a subroutine for X, that calls Y; if it finds Y, it returns success. If it doesnt find Y, it returns true anyway..
Dante over 6 years

@IraBaxter, thanks. There is another problem. For the grammar T = '(' T ')' | epsilon how can I report syntax error for token streams such as (())()? The grammar stops and returns true when it consumes the 2nd ) and it never sees ().
Ira Baxter over 6 years

@Dante: Well, it should stop after the 2nd ). After all, that's what your grammar rule says: you can have nested parentheses with nothing in them, and nothing else. (()) () isn't a valid string according to your one grammar rule. So, either it is invalid input, or you have other grammar rules you didn't tell us about.
Ira Baxter over 6 years

@Dante: ... oh wait, maybe you are complaining that it accepts the (()) part, but does not complain about the additional ()? That's because your goal rule needs to look like GOAL -> NONTERMINAL <EOF> where EOF is an end of input indicator. You can invent a special token for EOF, or you might use the ASCII <CR> character if reading lines. In practice, EOF isnt actually a token, it is exhaustion of the input string and you have some kind of predicate that can check for this. You modify your GOAL rule to check for EOF once it recognizes everything else and complains if it doesn't see EOF.
jchook almost 5 years

This answer finally got me rolling on writing my own LL(1) parser. However, the Crafting Interpreters books, chapters 4 through 6 really put the pedal to the metal for me.
Prof. Falken over 2 years

Minor nitpick, but the C language can handle AVR and Harvard architecture just fine. Rather, the gcc compiler was not designed to handle Harvard architecture. When the AVR instruction set was created, the hardware designer consulted a prominent compiler vendor: web.archive.org/web/20060529115932/https://…
Steve S over 2 years

I honestly haven't kept up with the details of the latest C standards, but my understanding was that C99 specified a single address space for data, so putting constants in program memory on a Harvard architecture would require something non-standard. The "Embedded C" extension to the standard does provide a mechanism for dealing with data in multiple distinct address spaces. open-std.org/JTC1/SC22/WG14/www/docs/n1169.pdf (page 37)
Lundin over 2 years

You can't reliably use recursion on an AVR. We are talking about a stack space of a few hundred bytes, not kb.
Ira Baxter over 2 years

You can't build a parser for a language with nested structures without using recursion in some form; whether you use the "stack" or build your own stack in memory doesn't change the nature of the problem. And limited memory will limit the size of stack no matter how limited. The good news is that the depth of recursion need to parse simple expressions is typically only a few; parsing complex programs might require a hundred. Both of these are small enough to squish inside something like an AVR (fair poit: I haven't actually tried on a AVR.)
Shlomo Gottlieb about 2 years

What if I have a rule like P = Q R | S and my input is QS? if I check alternatives like you suggest, it will match Q then fail on R so it will check the alternative S and succeed because when the when parser identified Q it advanced the input stream cursor passed that token. Do I have to backtrack the cursor or can I avoid it?
Ira Baxter about 2 years

A sequence is treated as if only the first sequence token is an optional alterantive; once committed to the first token of a sequence, the rest of the sequence elements are REQUIRED. Look at the example subroutine X. For your example, Q is matched, but there isn't a following R so parsing Q R | ... fails.
Ira Baxter about 2 years

... for the straightforward recursive descent parsers described here, you don't backtrack; FAIL produces a syntax error but does not pass control to the next alternative. If you enhance the parser with backtracking (simply save the input stream pointer at each decision point), then you can build considerably more powerful parsers than simple RDP. You can allow rules like "X = A B C | A Q B". Backtracking restores the input stream to that at the start of the sequence A B C and passes control to next sequence A Q B. But with erroneous input, backtracking may try everything and then fail.