Tutorial/resource for implementing VM

14,113

Solution 1

I assume you want a virtual machine rather than a mere interpreter. I think they are two points on a continuum. An interpreter works on something close to the original representation of the program. A VM works on more primitive (and self-contained) instructions. This means you need a compilation stage to translate the one to the other. I don't know if you want to work on that first or if you even have an input syntax in mind yet.

For a dynamic language, you want somewhere that stores data (as key/value pairs) and some operations that act on it. The VM maintains the store. The program running on it is a sequence of instructions (including control flow). You need to define the set of instructions. I'd suggest a simple set to start with, like:

  • basic arithmetic operations, including arithmetic comparisons, accessing the store
  • basic control flow
  • built-in print

You may want to use a stack-based computation approach to arithmetic, as many VMs do. There isn't yet much dynamic in the above. To get to that we want two things: the ability to compute the names of variables at runtime (this just means string operations), and some treatment of code as data. This might be as simple as allowing function references.

Input to the VM would ideally be in bytecode. If you haven't got a compiler yet this could be generated from a basic assembler (which could be part of the VM).

The VM itself consists of the loop:

1. Look at the bytecode instruction pointed to by the instruction pointer.
2. Execute the instruction:
   * If it's an arithmetic instruction, update the store accordingly.
   * If it's control flow, perform the test (if there is one) and set the instruction pointer.
   * If it's print, print a value from the store.
3. Advance the instruction pointer to the next instruction.
4. Repeat from 1.

Dealing with computed variable names might be tricky: an instruction needs to specify which variables the computed names are in. This could be done by allowing instructions to refer to a pool of string constants provided in the input.

An example program (in assembly and bytecode):

offset  bytecode (hex)   source
 0      01 05 0E         //      LOAD 5, .x
 3      01 03 10         // .l1: LOAD 3, .y
 6      02 0E 10 0E      //      ADD .x, .y, .x
10      03 0E            //      PRINT .x
12      04 03            //      GOTO .l1
14      78 00            //      .x: "x"
16      79 00            //      .y: "y"

The instruction codes implied are:

"LOAD x, k" (01 x k) Load single byte x as an integer into variable named by string constant at offset k.
"ADD k1, k2, k3" (02 v1 v2 v3) Add two variables named by string constants k1 and k2 and put the sum in variable named by string constant k3.
"PRINT k" (03 k) Print variable named by string constant k.
"GOTO a" (04 a) Go to offset given by byte a.

You need variants for when variables are named by other variables, etc. (and the levels of indirection get tricky to reason about). The assembler looks at the arguments like "ADD .x, .y, .x" and generates the correct bytecode for adding from string constants (and not computed variables).

Solution 2

Well, it's not about implementing a VM in C, but since it was the last tab I had open before I saw this question, I feel like I need point out an article about implementing a QBASIC bytecode compiler and virtual machine in JavaScript using the <canvas> tag for display. It includes all of the source code to get enough of QBASIC implemented to run the "nibbles" game, and is the first in a series of articles on the compiler and bytecode interpreter; this one describes the VM, and he's promising future articles describing the compiler as well.

By the way, I didn't vote to close your question, but the close vote you got was as a duplicate of a question from last year on how to learn about implementing a virtual machine. I think this question (about a tutorial or something relatively simple) is different enough from that one that it should remain open, but you might want to refer to that one for some more advice.

Solution 3

Another resource to look at is the implementation of the Lua language. It is a register-based VM that has a good reputation for performance. The source code is in ANSI C89, and is generally very readable.

As with most high performance scripting languages, the end user sees a readable, high level dynamic language (with features like closures, tail calls, immutable strings, numbers and hash tables as the primary data types, functions as first class values, and more). Source text is compiled to the VM's bytecode for execution by a VM implementation whose outline is pretty much as described by Edmund's answer.

A great deal of effort has gone into keeping the implementation of the VM itself both portable and efficient. If even more performance is needed, a just in time compiler from VM byte code to native instructions exists for 32-bit x86, and is in beta release for 64-bit.

Solution 4

For starting (even if not C, but C++) you could give a look to muParser.

It's a math expression parser that use a simple virtual machine to execute operations. I think that even you need time to understand everything; anyway this code is more simple than a complete VM able to run a real complete program. (By the way, I'm designing a similar lib in C# - it is its early stages but next versions will allow compilation to .NET/VM IL or maybe a new simple VM like muParser).

An other interesting thing is NekoVM (it executes .n bytecode files). It's an open source project written in C and it's main language (.neko) is thought to be generated by source-to-source compiler technology. In the spirit of last topic see Haxe from same author (open source too).

Solution 5

Like you I have also been studying virtual machines and compilers and one good book I can recommend is Compiler Design: Virtual Machines. It describes virtual machines for imperative, functional, logic, and object-oriented languages by giving the instruction set for each VM along with a tutorial on how to compile a higher-level language to that VM. I've only implemented the VM for the imperative language and already it has been a very useful exercise.

If you're just starting out then another resource I can recommend is PL101. It is an interactive set of lessons in JavaScript that guides you through the process of implementing parsers and interpreters for various languages.

Share:
14,113

Related videos on Youtube

zaharpopov
Author by

zaharpopov

Коммунизм!

Updated on August 27, 2020

Comments

  • zaharpopov
    zaharpopov almost 4 years

    I want self-education purpose implement a simple virtual machine for a dynamic language, prefer in C. Something like the Lua VM, or Parrot, or Python VM, but simpler. Are there any good resources/tutorials on achieving this, apart from looking at code and design documentations of the existing VMs?

    Edit: why close vote? I don't understand - is this not programming. Please comment if there is specific problem with my question.

    • tekknolagi
      tekknolagi almost 10 years
      If you are still interested, I wrote a really really simple VM in C. Take a look: github.com/tekknolagi/carp
  • zaharpopov
    zaharpopov over 14 years
    nice. any idea for resource to go from here?
  • Alexander H
    Alexander H over 14 years
    @zaharpopv: I'm not too sure about implementing the dynamic functionality of your language, but a simple VM design like the above is easy enough that once you've done it you will learn how suitable it is and can afford to change it to support more interesting features. Also, looking at the set of instructions for the Python interpreter might give you a few ideas on how to support dynamism.