How does the compilation/linking process work?

248,838

Solution 1

The compilation of a C++ program involves three steps:

  1. Preprocessing: the preprocessor takes a C++ source code file and deals with the #includes, #defines and other preprocessor directives. The output of this step is a "pure" C++ file without pre-processor directives.

  2. Compilation: the compiler takes the pre-processor's output and produces an object file from it.

  3. Linking: the linker takes the object files produced by the compiler and produces either a library or an executable file.

Preprocessing

The preprocessor handles the preprocessor directives, like #include and #define. It is agnostic of the syntax of C++, which is why it must be used with care.

It works on one C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending of #if, #ifdef and #ifndef directives.

The preprocessor works on a stream of preprocessing tokens. Macro substitution is defined as replacing tokens with other tokens (the operator ## enables merging two tokens when it makes sense).

After all this, the preprocessor produces a single output that is a stream of tokens resulting from the transformations described above. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with clever use of the #if and #error directives.

Compilation

The compilation step is performed on each output of the preprocessor. The compiler parses the pure C++ source code (now without any preprocessor directives) and converts it into assembly code. Then invokes underlying back-end(assembler in toolchain) that assembles that code into machine code producing actual binary file in some format(ELF, COFF, a.out, ...). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don't provide a definition for it. The compiler doesn't mind this, and will happily produce the object file as long as the source code is well-formed.

Compilers usually let you stop compilation at this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don't need to recompile everything if you only change a single file.

The produced object files can be put in special archives called static libraries, for easier reusing later on.

It's at this stage that "regular" compiler errors, like syntax errors or failed overload resolution errors, are reported.

Linking

The linker is what produces the final compilation output from the object files the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven't got much in common with static libraries mentioned earlier) or an executable.

It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.

At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don't exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.

Solution 2

This topic is discussed at CProgramming.com:
https://www.cprogramming.com/compilingandlinking.html

Here is what the author there wrote:

Compiling isn't quite the same as creating an executable file! Instead, creating an executable is a multistage process divided into two components: compilation and linking. In reality, even if a program "compiles fine" it might not actually work because of errors during the linking phase. The total process of going from source code files to an executable might better be referred to as a build.

Compilation

Compilation refers to the processing of source code files (.c, .cc, or .cpp) and the creation of an 'object' file. This step doesn't create anything the user can actually run. Instead, the compiler merely produces the machine language instructions that correspond to the source code file that was compiled. For instance, if you compile (but don't link) three separate files, you will have three object files created as output, each with the name .o or .obj (the extension will depend on your compiler). Each of these files contains a translation of your source code file into a machine language file -- but you can't run them yet! You need to turn them into executables your operating system can use. That's where the linker comes in.

Linking

Linking refers to the creation of a single executable file from multiple object files. In this step, it is common that the linker will complain about undefined functions (commonly, main itself). During compilation, if the compiler could not find the definition for a particular function, it would just assume that the function was defined in another file. If this isn't the case, there's no way the compiler would know -- it doesn't look at the contents of more than one file at a time. The linker, on the other hand, may look at multiple files and try to find references for the functions that weren't mentioned.

You might ask why there are separate compilation and linking steps. First, it's probably easier to implement things that way. The compiler does its thing, and the linker does its thing -- by keeping the functions separate, the complexity of the program is reduced. Another (more obvious) advantage is that this allows the creation of large programs without having to redo the compilation step every time a file is changed. Instead, using so called "conditional compilation", it is necessary to compile only those source files that have changed; for the rest, the object files are sufficient input for the linker. Finally, this makes it simple to implement libraries of pre-compiled code: just create object files and link them just like any other object file. (The fact that each file is compiled separately from information contained in other files, incidentally, is called the "separate compilation model".)

To get the full benefits of condition compilation, it's probably easier to get a program to help you than to try and remember which files you've changed since you last compiled. (You could, of course, just recompile every file that has a timestamp greater than the timestamp of the corresponding object file.) If you're working with an integrated development environment (IDE) it may already take care of this for you. If you're using command line tools, there's a nifty utility called make that comes with most *nix distributions. Along with conditional compilation, it has several other nice features for programming, such as allowing different compilations of your program -- for instance, if you have a version producing verbose output for debugging.

Knowing the difference between the compilation phase and the link phase can make it easier to hunt for bugs. Compiler errors are usually syntactic in nature -- a missing semicolon, an extra parenthesis. Linking errors usually have to do with missing or multiple definitions. If you get an error that a function or variable is defined multiple times from the linker, that's a good indication that the error is that two of your source code files have the same function or variable.

Solution 3

GCC compiles a C/C++ program into executable in 4 steps.

For example, gcc -o hello hello.c is carried out as follows:

1. Pre-processing

Preprocessing via the GNU C Preprocessor (cpp.exe), which includes the headers (#include) and expands the macros (#define).

cpp hello.c > hello.i

The resultant intermediate file "hello.i" contains the expanded source code.

2. Compilation

The compiler compiles the pre-processed source code into assembly code for a specific processor.

gcc -S hello.i

The -S option specifies to produce assembly code, instead of object code. The resultant assembly file is "hello.s".

3. Assembly

The assembler (as.exe) converts the assembly code into machine code in the object file "hello.o".

as -o hello.o hello.s

4. Linker

Finally, the linker (ld.exe) links the object code with the library code to produce an executable file "hello".

    ld -o hello hello.o ...libraries...

Solution 4

On the standard front:

  • a translation unit is the combination of a source files, included headers and source files less any source lines skipped by conditional inclusion preprocessor directive.

  • the standard defines 9 phases in the translation. The first four correspond to preprocessing, the next three are the compilation, the next one is the instantiation of templates (producing instantiation units) and the last one is the linking.

In practice the eighth phase (the instantiation of templates) is often done during the compilation process but some compilers delay it to the linking phase and some spread it in the two.

Solution 5

The skinny is that a CPU loads data from memory addresses, stores data to memory addresses, and execute instructions sequentially out of memory addresses, with some conditional jumps in the sequence of instructions processed. Each of these three categories of instructions involves computing an address to a memory cell to be used in the machine instruction. Because machine instructions are of a variable length depending on the particular instruction involved, and because we string a variable length of them together as we build our machine code, there is a two step process involved in calculating and building any addresses.

First we laying out the allocation of memory as best we can before we can know what exactly goes in each cell. We figure out the bytes, or words, or whatever that form the instructions and literals and any data. We just start allocating memory and building the values that will create the program as we go, and note down anyplace we need to go back and fix an address. In that place we put a dummy to just pad the location so we can continue to calculate memory size. For example our first machine code might take one cell. The next machine code might take 3 cells, involving one machine code cell and two address cells. Now our address pointer is 4. We know what goes in the machine cell, which is the op code, but we have to wait to calculate what goes in the address cells till we know where that data will be located, i.e. what will be the machine address of that data.

If there were just one source file a compiler could theoretically produce fully executable machine code without a linker. In a two pass process it could calculate all of the actual addresses to all of the data cells referenced by any machine load or store instructions. And it could calculate all of the absolute addresses referenced by any absolute jump instructions. This is how simpler compilers, like the one in Forth work, with no linker.

A linker is something that allows blocks of code to be compiled separately. This can speed up the overall process of building code, and allows some flexibility with how the blocks are later used, in other words they can be relocated in memory, for example adding 1000 to every address to scoot the block up by 1000 address cells.

So what the compiler outputs is rough machine code that is not yet fully built, but is laid out so we know the size of everything, in other words so we can start to calculate where all of the absolute addresses will be located. the compiler also outputs a list of symbols which are name/address pairs. The symbols relate a memory offset in the machine code in the module with a name. The offset being the absolute distance to the memory location of the symbol in the module.

That's where we get to the linker. The linker first slaps all of these blocks of machine code together end to end and notes down where each one starts. Then it calculates the addresses to be fixed by adding together the relative offset within a module and the absolute position of the module in the bigger layout.

Obviously I've oversimplified this so you can try to grasp it, and I have deliberately not used the jargon of object files, symbol tables, etc. which to me is part of the confusion.

Share:
248,838
Tony The Lion
Author by

Tony The Lion

#disgusted

Updated on July 12, 2022

Comments

  • Tony The Lion
    Tony The Lion almost 2 years

    How does the compilation and linking process work?

    (Note: This is meant to be an entry to Stack Overflow's C++ FAQ. If you want to critique the idea of providing an FAQ in this form, then the posting on meta that started all this would be the place to do that. Answers to that question are monitored in the C++ chatroom, where the FAQ idea started out in the first place, so your answer is very likely to get read by those who came up with the idea.)

  • josesuero
    josesuero almost 13 years
    Could you list all 9 phases? That'd be a nice addition to the answer, I think. :)
  • sbi
    sbi almost 13 years
  • AProgrammer
    AProgrammer almost 13 years
    @jalf, just add the template instantiation just before the last phase in the answer pointed by @sbi. IIRC there are be subtle differences in the precise wording in the handling of wide characters, but I don't think they surface up in the diagram labels.
  • josesuero
    josesuero almost 13 years
    @sbi yeah, but this is supposed to be the FAQ question, isn't it? So shouldn't this information be available here? ;)
  • AProgrammer
    AProgrammer almost 13 years
    @jalf when I've time I'll try to put up a similar diagram (which would better hilight the multi compilation unit case and the handling of templates). But don't hold your breath, being clear in ASCII art is an art I don't really master.
  • sbi
    sbi almost 13 years
    @jalf: I agree. Why don't you post an answer explaining this in detail?
  • josesuero
    josesuero almost 13 years
    @AProgrammmer: simply listing them by name would be helpful. Then people know what to search for if they want more detail. Anyway, +1'ed your answer in any case :)
  • manav m-n
    manav m-n over 11 years
    The compilation stage also calls assembler before converting to object file.
  • binarysmacker
    binarysmacker over 10 years
    What I'm not understanding is that if the preprocessor manages things such as #includes to create one super file then surly there's nothing to link after that?
  • Elliptical view
    Elliptical view over 10 years
    @binarysmacer See if what I wrote below makes any sense to you. I tried to describe the problem from the inside out.
  • Bart van Heukelom
    Bart van Heukelom almost 10 years
    Where are optimizations applied? On first glance it seems like it would be done in the compilation step, but on the other hand I can imagine that proper optimization can only be done after linking.
  • R. Martinho Fernandes
    R. Martinho Fernandes almost 10 years
    @BartvanHeukelom traditionally it was done during compilation, but modern compilers support the so-called "link-time optimisation" which has the advantage of being able to optimise across translation units.
  • R. Martinho Fernandes
    R. Martinho Fernandes almost 10 years
    To be clear, link-time optimisation doesn't prevent optimisation from being done in the compilation pass. What it does is take advantage of the additional information at link-time to perform more powerful optimisations.
  • Kevin Zhu
    Kevin Zhu almost 10 years
    Does C have same steps?
  • Dan Carter
    Dan Carter over 9 years
    If the linker converts symbols referring to classes/method in libraries into addresses, does that mean that library binaries are stored in memory addresses that the OS keeps constant? I'm just confused as to how the linker would know the exact address of, say, the stdio binary for all target systems. The file path would always be the same, but the exact address can change, right?
  • asgs
    asgs over 8 years
    @DanCarter I'm wondering the same too. Hopefully, those are not runtime memory addresses, unless someone else clarifies it.
  • Karan Joisher
    Karan Joisher almost 8 years
    @binarysmacker It's too late to comment on this, but others might find this useful. youtu.be/D0TazQIkc8Q Basically you include header files and these header files generally contain only the declarations of variables/functions and not there definitions, definitions might be present in a separate source file.So preprocessor is only including declarations and not definitions this is where linker helps.You link the source file that uses the variable/function with the source file that defines them.
  • Second Person Shooter
    Second Person Shooter about 6 years
    The process of compiling followed by linking is called building.
  • uliwitness
    uliwitness over 5 years
    Yes @KevinZhu these steps are the same for C.
  • uliwitness
    uliwitness over 5 years
    @DanCarter It depends on the platform and linker, but in general, a linker only generates relative addresses. That means it might put main() at 0, myFunction() at 100. Then when the operating system actually loads the executable for running, it will load the code at a certain address and then all the addresses are offset by whatever address the executable's code was loaded at. (It just adds a number to it)
  • uliwitness
    uliwitness over 5 years
    @DanCarter However, some platforms have the compiler pre-determine addresses. E.g. on many Unixes, each application has "virtual addresses", that means there is a special part of the CPU, the Memory Management Unit (MMU), that translates all addresses from the "fake" addresses of the executable to the actual addresses in real memory (this is also used for swapping). In that case you usually have a reserved range, e.g. 0...10000, reserved for the operating system, and then your application at addresses 10000+.
  • Second Person Shooter
    Second Person Shooter over 5 years
    From the second paragraph to the third one in section "Linking", does the term "libraries" refer to "static libraries"? I ask this because it is a bit confusing to me as a newbie. Please poke me if you answer my question.
  • Second Person Shooter
    Second Person Shooter almost 3 years
    Sorry for interrupting: "The total process of going from source code files to an executable might better be referred to as a build.", how about the case in which the final output is either a static library or a dynamic library rather than an executable file? Is the term "build" still appropriate?
  • Amal lal T L
    Amal lal T L almost 3 years
    ld: warning: cannot find entry symbol main; defaulting to 0000000000400040 - Error using ld. My code is a helloworld. The process is done in Ubuntu.