Does compiling a program twice produce a bit-for-bit identical binary?

20,191

Solution 1

  1. Compile same program with same settings on same machine:

    Although the definitive answer is "it depends", it is reasonable to expect that most compilers will be deterministic most of the time, and that the binaries produced should be identical. Indeed, some version control systems depend on this. Still, there are always exceptions; it is quite possible that some compiler somewhere will decide to insert a timestamp or some such (iirc, Delphi does, for example). Or the build process itself might do that; I've seen makefiles for C programs which set a preprocessor macro to the current timestamp. (I guess that would count as being a different compiler setting, though.)

    Also, be aware that if you statically link the binary, then you are effectively incorporating the state of all relevant libraries on your machine, and any change in any one of those will also affect your binary. So it is not just compiler settings which are relevant.

  2. Compile same program on a different machine with a different CPU.

    Here, all bets are off. Most modern compilers are capable of doing target-specific optimizations; if this option is enabled, then the binaries are likely to differ unless the CPUs are similar (and even then, it's possible). Also, see the above note about static linking: the configuration environment goes far beyond the compiler settings. Unless you have very strict configuration control, it's extremely likely that something differs between the two machines.

Solution 2

  • -frandom-seed=123 controls some GCC internal randomness. man gcc says:

    This option provides a seed that GCC uses in place of random numbers in generating certain symbol names that have to be different in every compiled file. It is also used to place unique stamps in coverage data files and the object files that produce them. You can use the -frandom-seed option to produce reproducibly identical object files.

  • __FILE__: put the source in a fixed folder (e.g. /tmp/build)

  • for __DATE__, __TIME__, __TIMESTAMP__:
    • libfaketime : https://github.com/wolfcw/libfaketime
    • override those macros with -D
    • -Wdate-time or -Werror=date-time: warn or fail if either __TIME__, __DATE__ or __TIMESTAMP__ are is used. The Linux kernel 4.4 uses it by default.
  • use the D flag with ar, or use https://github.com/nh2/ar-timestamp-wiper/tree/master to wipe stamps
  • -fno-guess-branch-probability: older manual versions say it is a source of non-determinism, but not anymore. Not sure if this is covered by -frandom-seed or not.

The Debian Reproducible builds project attempts to standardize Debian packages byte-by-byte, and recently got a Linux Foundation grant. That includes more than just compilation, but it should be of interest.

Buildroot has a BR2_REPRODUCIBLE option which may give some ideas on the package level, but it is far from complete at this point.

Related threads:

Solution 3

What your are asking is "is the output deterministic." If you compiled the program once, immediately compiled it again you would probably end up with the same output file. However, if anything changed - even a small change - especially in a component the compiled program uses, then the output of the compiler might also change.

Solution 4

Does recompiling a program produce a bit-for-bit identical binary?

For all compilers? No. The C# compiler, at least, is not allowed to.

Eric Lippert has a very thorough breakdown on why the output of the compiler is not deterministic.

[T]he C# compiler by design never produces the same binary twice. The C# compiler embeds a freshly generated GUID in every assembly, every time you run it, thereby ensuring that no two assemblies are ever bit-for-bit identical. To quote from the CLI specification:

The Mvid column shall index a unique GUID [...] that identifies this instance of the module. [...] The Mvid should be newly generated for every module [...] While the [runtime] itself makes no use of the Mvid, other tools (such as debuggers [...]) rely on the fact that the Mvid almost always differs from one module to another.

Although it's specific to a version of the C# compiler, many points in the article can be applied to any compiler.

First off, we are assuming that we always get the same list of files every time, in the same order. But that's in some cases up to the operating system. When you say "csc *.cs", the order in which the operating system proffers up the list of matching files is an implementation detail of the operating system; the compiler does not sort that list into a canonical order.

Solution 5

I'd say NO, it is not 100% deterministic. I previously worked with a version of GCC which generates target binaries for the Hitachi H8 processor.

It is not a problem with the time stamp. Even if the time stamp issue is ignored, the specific processor architecture may allow the same instruction to be encoded in 2 slightly different ways where some bits can be 1 or 0. My previous experience shows that the generated binaries were the same MOST of the time but occasionally the gcc would generate binaries with identical size but some of the bytes different by only 1 bit e.g. 0XE0 becomes 0XE1.

Share:
20,191

Related videos on Youtube

David
Author by

David

America.

Updated on September 18, 2022

Comments

  • David
    David over 1 year

    If I were to compile a program into a single binary, make a checksum, and then recompile it on the same machine with the same compiler and compiler settings and checksum the recompiled program, would the checksum fail?

    If so, why is this? If not, would having a different CPU result in a non-identical binary?

    • Admin
      Admin over 10 years
      It depends on the compiler. Some of them embed time stamps, so the answer is "no" for those.
    • Admin
      Admin over 10 years
      Actually it depends on the executable format, not the compiler. Some executable formats like Windows’ PE format include a timestamp which is touched to the compile time and date, while other formats like Linux’ ELF format do not. Either way, this question hinges on the definition of “identical binary”. The image itself will/should be bitwise identical if the same source file is compiled with the same compiler and libraries and switches and everything, but the header and other metadata can vary.
  • CodesInChaos
    CodesInChaos over 10 years
    It shouldn't be hard to make the built reproducible (apart from a few easily discarded fields like compilation time and the assembly GUID). For example sorting input files into a canonical order is a one-liner. Even that GUID could be a hash of the remainder of the assembly instead of newly generated.
  • David
    David over 10 years
    Say I was using GCC, and I wasn't using the march option (the option that optimizes the binary for a specific family of CPU's), and I was to compile a binary with one CPU, and then with another CPU would there be a difference?
  • rici
    rici over 10 years
    @David: It still depends. First, the libraries you're linking to may have architecture-specific builds. So the output of gcc -c may well be identical, but the linked versions differ. Also, it's not just -march; there is also -mtune/-mcpu and -mfpmatch (and possibly others). Some of these may have different defaults on different installations, so you may need to force the worst-possible case for your machines explicitly; doing so might significantly reduce performance, particularly if you revert to i386 without sse. And, of course, if one of your cpus is an ARM and the other an i686...
  • David
    David over 10 years
    I assume you mean the Microsoft C# compiler, or is it a requirement of the specification?
  • David
    David over 10 years
    Also, is GCC one of the compilers in question that add a timestamp to binaries?
  • rici
    rici over 10 years
    @david: afaik, no.
  • ta.speot.is
    ta.speot.is over 10 years
    @David The CLI spec requires it. Mono's C# compiler would have to do the same. Ditto for any VB .NET compiler.
  • ack
    ack over 9 years
    Very good point indeed. This article has some very interesting observations. In particular, compilation with GCC may not be deterministic with regards to inputs in certain cases, for instance in how it mangles functions in anonymous namespaces, for which it uses a random number generator internally. To get determinism in this particular case, supply an initial random seed by specifying the option -frandom-seed=string.
  • Shiv
    Shiv about 9 years
    The ECMA standard does not have to have timestamps or MVID differences. Without those, it is at least possible for identical binaries in C#. Thus the main reason is a questionable design decision and not a real technical constraint.
  • Florian Straub
    Florian Straub about 5 years
    And did that lead to different behavior or "serious problems"?
  • Ciro Santilli Путлер Капут 六四事
    Ciro Santilli Путлер Капут 六四事 almost 5 years
    Cool links. I'm a Buildroot fanboy, but if someone gives me a Nix ARM cross arch setup that boots on QEMU, I'll be happy :-)
  • Daniel Labonté
    Daniel Labonté almost 5 years
    I didn't mention Guix because I don't know where to find their numbers, but they were before NixOS on the reproducibility train with verification tooling and such, so I'm sure they're on equal footing or better.