What do C and Assembler actually compile to?

25,953

Solution 1

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.

Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:

  • Code and data appear in named "sections".

  • Relocatable object files may include definitions of labels, which refer to locations within the sections.

  • Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.

For example, if you compile and assemble (but don't link) this program

int main () { printf("Hello, world\n"); }

you are likely to wind up with a relocatable object file with

  • A text section containing the machine code for main

  • A label definition for main which points to the beginning of the text section

  • A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"

  • A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.

If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.

I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

Solution 2

Let's take a C program.

When you run gcc, clang, or 'cl' on the c program, it will go through these stages:

  1. Preprocessor (#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...) including lexing into preprocessor tokens and eventually resulting in flat text for input to the compiler proper.
  2. Lexical analysis (producing tokens and lexical errors).
  3. Syntactical analysis (producing a parse tree and syntactical errors).
  4. Semantic analysis (producing a symbol table, scoping information and scoping/typing errors) Also data-flow, transforming the program logic into an "intermediate representation" that the optimizer can work with. (Often an SSA). clang/LLVM uses LLVM-IR, gcc uses GIMPLE then RTL.
  5. Optimization of the program logic, including constant propagation, inlining, hoisting invariants out of loops, auto-vectorization, and many many other things. (Most of the code for a widely-used modern compiler is optimization passes.) Transforming through intermediate representations is just part of how some compilers work, making it impossible / meaningless to "disable all optimizations"
  6. Outputing into assembly source (or another intermediate format like .NET IL bytecode)
  7. Assembling of the assembly into some binary object format.
  8. Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
  9. Output of final executable in elf, PE/coff, MachO64, or whatever other format

In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...)

Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .com executable

You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.

As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.

I've wiki'd this answer to let people tweak any errors/add information.

Solution 3

There are different phases in translating C++ into a binary executable. The language specification does not explicitly state the translation phases. However, I will describe the common translation phases.

Source C++ To Assembly or Itermediate Language

Some compilers actually translate the C++ code into an assembly language or an intermediate language. This is not a required phase, but helpful in debugging and optimizations.

Assembly To Object Code

The next common step is to translate Assembly language into an Object code. The object code contains assembly code with relative addresses and open references to external subroutines (methods or functions). In general, the translator puts in as much information into an object file as it can, everything else is unresolved.

Linking Object Code(s)

The linking phase combines one or more object codes, resolves references and eliminates duplicate subroutines. The final output is an executable file. This file contains information for the operating system and relative addresses.

Executing Binary Files

The Operating System loads the executable file, usually from a hard drive, and places it into memory. The OS may convert relative addresses into physical locations. The OS may also prepare resources (such as DLLs and GUI widgets) that are required by the executable (which may be stated in the Executable file).

Compiling Directly To Binary Some compilers, such as the ones used in Embedded Systems, have the capability to compile from C++ directly to an executable binary code. This code will have physical addresses instead of relative address and not require an OS to load.

Advantages

One of the advantages of these phases is that C++ programs can be broken into pieces, compiled individually and linked at a later time. They can even be linked with pieces from other developers (a.k.a. libraries). This allows developers to only compiler pieces in development and link in pieces that are already validated. In general, the translation from C++ to object is the time consuming part of the process. Also, a person doesn't want to wait for all the phases to complete when there is an error in the source code.

Keep an open mind and always expect the Third Alternative (Option).

Solution 4

To answer your questions, please note that this is subjective as there are different processors, different platforms, different assemblers and C compilers, in this case, I will talk about the Intel x86 platform.

  1. Assemblers do not usually assemble to pure / flat binary (raw machine code), instead usually to a file defined with segments such as data, text and bss to name but a few; this is called an object file. The Linker steps in and adjusts the segments to make it executable, that is, ready to run. Incidentally, the default output when you assemble using GNU as foo.s is a.out, that is a shorthand for Assembler Output. (But the same filename is the gcc default for linker output, with the assembler output being only a temporary.)
  2. Boot loaders have a special directive defined, back in the days of DOS, it would be common to find a directive such as .Org 100h, which defines the assembler code to be of the old .COM variety before .EXE took over in popularity. Also, you did not need to have a assembler to produce a .COM file, using the old debug.exe that came with MSDOS, did the trick for small simple programs, the .COM files did not need a linker and were straight ready-to-run binary format. Here's a simple session using DEBUG.
1:*a 0100
2:* mov AH,07
3:* int 21
4:* cmp AL,00
5:* jnz 010c
6:* mov AH,07
7:* int 21
8:* mov AH,4C
9:* int 21
10:*
11:*r CX
12:*10
13:*n respond.com
14:*w
15:*q

This produces a ready-to-run .COM program called 'respond.com' that waits for a keystroke and not echo it to the screen. Notice, the beginning, the usage of 'a 100h' which shows that the Instruction pointer starts at 100h which is the feature of a .COM. This old script was mainly used in batch files waiting for a response and not echo it. The original script can be found here.

Again, in the case of boot loaders, they are converted to a binary format, there was a program that used to come with DOS, called EXE2BIN. That was the job of converting the raw object code into a format that can be copied on to a bootable disk for booting. Remember no linker is run against the assembled code, as the linker is for the runtime environment and sets up the code to make it runnable and executable.

The BIOS when booting, expects code to be at segment:offset, 0x7c00, if my memory serves me correct, the code (after being EXE2BIN'd), will start executing, then the bootloader relocates itself lower down in memory and continue loading by issuing int 0x13 to read from the disk, switch on the A20 gate, enable the DMA, switch onto protected mode as the BIOS is in 16bit mode, then the data read from the disk is loaded into memory, then the bootloader issues a far jump into the data code (likely to be written in C). That is in essence how the system boots.

Ok, the previous paragraph sounds abstracted and simple, I may have missed out something, but that is how it is in a nutshell.

Solution 5

They compile to a file in a specific format (COFF for Windows, etc), composed of headers and segments, some of which have "plain binary" op codes. Assemblers and compilers (such as C) create the same sort of output. Some formats, such as the old *.COM files, had no headers, but still had certain assumptions (such as where in memory it would get loaded or how big it could be).

On Windows machines, the OS's boostrapper is in a disk sector loaded by the BIOS, where both of these are "plain". Once the OS has loaded its loader, it can read files that have headers and segments.

Does that help?

Share:
25,953
lamas
Author by

lamas

I'm a web developer primarily focusing on PHP, MySQL, HTML, CSS and a bit of Flash too - However I have some experience in Python and C++ too. Always there to try out new things!

Updated on July 09, 2022

Comments

  • lamas
    lamas almost 2 years

    So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...

    • Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.

    • If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?

    edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

  • Paul Nathan
    Paul Nathan over 14 years
    That's true in most cases. Some assembly languages have psuedo-operations which are sorta macros.
  • Admin
    Admin over 14 years
    Every "fact" in this answer is wrong.
  • Steven Sudit
    Steven Sudit over 14 years
    Assembler is as optimized as you make it. C++, managed or otherwise, normally compiles into complex executables with headers and segments, not plain binary. The BIOS and the early parts of the OS are plain binary.
  • mr-sk
    mr-sk over 14 years
    Neil - why not correct it then?
  • mmx
    mmx over 14 years
    "microcode" is a completely misleading word to use when you're referring to "intermediate code" -- and intermediate code is actually considered "binary" (probably not native binary).
  • MBO
    MBO over 14 years
    Assembler !== binary. In assembler you can use symbolic names, labels and so on, which has no direct representation in binary, they need to be replaced by actual numbers. If you add some code before label, then that label should be moved to some other address. Assembler is simple programming language, which translates directly to binary, but is not binary itself.
  • Steven Sudit
    Steven Sudit over 14 years
    Almost directly. The same opcode compiles to different binary depending on details such as how the data is addressed. Likewise, even an assembler will sneak in prefix operators as needed. So while there is a very, very close relationship, they're not quite 1:1.
  • nobody
    nobody over 14 years
    True, but not an answer to the asked question.
  • Admin
    Admin over 14 years
    I think you are a bit confused about registers. You are correct that there isn a on-to-one corespondence between an assembler opcode and a machine code instruction, however.
  • Steven Sudit
    Steven Sudit over 14 years
    @Paul Nathan: Good point. Macro-assemblers are a step closer to compilers.
  • Steven Sudit
    Steven Sudit over 14 years
    @Neil: You're right to point out that registers, by definition, don't have addresses, as they're not in memory. However, on architectures with a large number of general-purpose registers (many RISC CPU's), we can be forgiven for thinking of the register number as an address "of sorts".
  • Laizer
    Laizer over 14 years
    I was aiming for the 'does machine code compile to binary' side of the question. Tried to paint the relationship, rather than just saying 'not really'.
  • jsoverson
    jsoverson over 14 years
    Plain binary? Everything stored on a hard drive is binary, that statement is meaningless.
  • ThePosey
    ThePosey over 14 years
    "Assembler compiles to pure binary, but, as strange as it gets, it is less optimized than C(++)" What is that even supposed to mean? There are misleading issues with this accepted answer.
  • wich
    wich over 14 years
    It depends a bit on what assembler you use, though most assemblers these days are macro assemblers, offering a bit more.
  • wich
    wich over 14 years
    @Neil, that would be between an assembly mnemonic and a cpu opcode, or machine instruction.
  • Steven Sudit
    Steven Sudit over 14 years
    The thing to remember is that the CIL is contained inside a COFF executable.
  • Steven Sudit
    Steven Sudit over 14 years
    jsoverson: In this context, "plain binary" refers to opcodes without the headers and segments.
  • Pete Kirkham
    Pete Kirkham over 14 years
    Assembler (a human readable macro language which is translated to machine code) != Assembly (the binary file generated by common language infrastructure compilers, where each operation has a binary string). I think you may have misunderstanding.
  • Steven Sudit
    Steven Sudit over 14 years
    ThePosey: My guess is that they're trying to say that assemblers don't optimize code, whereas compilers typically do (when not in debug mode). Not claiming their answer was clear or correct, just that they might have been thinking of the right thing.
  • Marco van de Voort
    Marco van de Voort over 14 years
    Most C compilers compile directly to relocatable machine code. It is faster to skip the slow textual step. Some (like 16-bit compilers capable of .COM files) can generate non-relocatable code directly. One could argue though that in directly machinecode generating compilers, the assembler is a relative separate standing part.
  • Marco van de Voort
    Marco van de Voort over 14 years
    Hmm, the Dragon book is mostly about parsing. I'd recommend "Linkers and Loaders" by Levine, iecc.com/linker which is also available on the web.
  • Paul Nathan
    Paul Nathan over 14 years
    Linkers and loaders is also a good book.
  • Thomas Pornin
    Thomas Pornin over 14 years
    Actually, in the "logical" order, lexical analysis occurs before preprocessing, because the preprocessor operates on a stream of tokens. That's how it is defined in the C standard, and that is also how it happens in modern versions of gcc (when the preprocessor was rewritten and turned into a lexing library).
  • Paul Nathan
    Paul Nathan over 14 years
    Thomas: Interesting! I am out of date
  • Paul Nathan
    Paul Nathan over 14 years
    C standard, 5.1.1.2 suggests that traditional lexing is logically separate from preprocessor lexing.
  • Potatoswatter
    Potatoswatter over 14 years
    Relocatable code is not a requirement of C, and many platforms don't use it.
  • Marco van de Voort
    Marco van de Voort about 14 years
    Which was really interesting when we had 100kwords memory, but is it nowadays still an advantage or more an artefact? A compilation granularity that would utilize available memory better (e.g. to avoid repeated header reparsing, relative slow disk I/O or even just binary startup time) would be more in line with modern requirements?
  • Lothar
    Lothar over 13 years
    Is there any script for your course available online?
  • Norman Ramsey
    Norman Ramsey over 13 years
    @Lothar my course is online at cs.tufts.edu/comp/40. For past years, see my home page. For obvious reasons the answers are not online.
  • Peter Cordes
    Peter Cordes over 3 years
    As discussed under your answer on a duplicate (Do programming language compilers first translate to assembly or directly to machine code?) mainstream C++ compilers with large development teams like MSVC, ICC, and clang/LLVM (but still not GCC), all output relocatable .o / .obj files directly by default, with machine-code generation and object file format handling as a library (in LLVM's case) not a separate program. See also Does a compiler always produce an assembly code?
  • Peter Cordes
    Peter Cordes almost 3 years
    debug.exe is an assembler. (A bad one by modern standards, e.g. no labels so you have to calculate branch target addresses by hand.) Also, raw machine code is not an object file; if it was literally raw (like nasm -f bin output, e.g. a .com file), there's no section metadata, or any other metadata. I made an edit to that paragraph.
  • t0mm13b
    t0mm13b almost 3 years
    @PeterCordes True, but it would be unfair to compare it to modern standards as this was part of the MSDOS install base back in the 80's and 90s, this was long before Linux / open source, appeared on the scene which opened up the corridors of the general awareness of standards. :)
  • Peter Cordes
    Peter Cordes almost 3 years
    It's 100% fair if people are proposing still using it today! Apparently some poor unfortunate folks get homework that requires them to write 16-bit x86 DOS code for debug.exe, leading to questions on SO about it. That's what I meant by saying "by modern standards". Also, it was better than nothing at the time, but even then I assume you'd want TASM, MASM, or AS86 if you could get them, for anything more than small toy stuff.