How does Git create unique commit hashes, mainly the first few characters?

25,981

Solution 1

Git uses the following information to generate the sha-1:

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info (with timestamp)
  • The committer info (right, those are different!, also with timestamp)
  • The commit message

(on the complete explanation; look here).

Git does NOT guarantee that the first 4 characters will be unique. In chapter 7 of the Pro Git Book it is written:

Git can figure out a short, unique abbreviation for your SHA-1 values. If you pass --abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous:

So Git just makes the abbreviation as long as necessary to remain unique. They even note that:

Generally, eight to ten characters are more than enough to be unique within a project.

As an example, the Linux kernel, which is a pretty large project with over 450k commits and 3.6 million objects, has no two objects whose SHA-1s overlap more than the first 11 characters.

So in fact they just depend on the great improbability of having the exact same (X first characters of a) sha.

Solution 2

Apr. 2017: Beware that after the all shattered.io episode (where a SHA1 collision was achieved by Google), the 20-byte format won't be there forever.

A first step for that is to replace unsigned char sha1[20] which is hard-code all over the Git codebase by a generic object whose definition might change in the future (SHA2?, Blake2, ...)

See commit e86ab2c (21 Feb 2017) by brian m. carlson (bk2204).

Convert the remaining uses of unsigned char [20] to struct object_id.

That is an example of an ongoing effort started with commit 5f7817c (13 Mar 2015) by brian m. carlson (bk2204), for v2.5.0-rc0, in cache.h:

/* The length in bytes and in hex digits of an object name (SHA-1 value). */
#define GIT_SHA1_RAWSZ 20
#define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ)

struct object_id {
    unsigned char hash[GIT_SHA1_RAWSZ];
};

And don't forget that, even with SHA1, the 4 first characters are no longer enough to guarantee uniqueness, as I explain in "How much of a git sha is generally considered necessary to uniquely identify a change in a given codebase?".


Update Dec. 2017 with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".

You will be able to use another hash: SHA1 is no longer the only one for Git.

Update 2018-2019: the choice has been made in Git 2.19+: SHA-256.
See "hash-function-transition".

This is not yet active (meaning git 2.21 is still using SHA1), but the code is being done to support in the future SHA-256.


With Git 2.26 (Q1 2020), the work goes on, and uses "struct object_id" for replacing use of "char *sha1"

See commit 2fecc48, commit 6ac9760, commit b99b6bc, commit 63f4a7f, commit e31c710, commit 500e4f2, commit f66d4e0, commit a93c141, commit 3f83fd5, commit 0763671 (24 Feb 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit e8e7184, 05 Mar 2020)

packfile: drop nth_packed_object_sha1()

Signed-off-by: Jeff King

Once upon a time, nth_packed_object_sha1() was the primary way to get the oid of a packfile's index position.
But these days we have the more type-safe nth_packed_object_id() wrapper, and all callers have been converted.

Let's drop the "sha1" version (turning the safer wrapper into a single function) so that nobody is tempted to introduce new callers.


With Git 2.29 (Q4 2020), the "sha1 to oid" rename continues..

See commit a46d1f7, commit fb07bd4, commit cfaf9f0, commit ef2d554, commit 962dd7e, commit 8f7e3de, commit b1f1ade (27 Sep 2020) by Martin Ågren (none).
(Merged by Junio C Hamano -- gitster -- in commit 07601b5, 05 Oct 2020)

wt-status: replace sha1 mentions with oid

Signed-off-by: Martin Ågren

abbrev_sha1_in_line() uses a struct object_id oid and should be fully prepared to handle non-SHA1 object ids. Rename it to abbrev_oid_in_line().

A few comments in wt_status_get_detached_from() mention "sha1". The variable they refer to was renamed in e86ab2c1cd ("wt-status: convert to struct object_id", 2017-02-21, Git v2.13.0-rc0). Update the comments to reference "oid" instead.

Share:
25,981

Related videos on Youtube

Ben
Author by

Ben

professional incoherent question asker

Updated on July 09, 2022

Comments

  • Ben
    Ben almost 2 years

    I find it hard to wrap my head around how Git creates fully unique hashes that aren't allowed to be the same even in the first 4 characters. I'm able to call commits in Git Bash using only the first four characters. Is it specifically decided in the algorithm that the first characters are "ultra"-unique and will not ever conflict with other similar hashes, or does the algorithm generate every part of the hash in the same way?

    • Maroun
      Maroun over 8 years
      [a-z], [A-Z] and [0-9] are possible values for characters. You have 72 * 72 * 72 *72 unique options for 4 characters. If you have more than 26873856 commits, maybe you should think about your project again (if you'll be alive).
    • Ben
      Ben over 8 years
      But how does the algorithm make sure that there will never be a commit hash with the first 5 characters the same as another?
    • axiac
      axiac over 8 years
      @MarounMaroun 26+26+10 != 72 :-) but your point is right.
    • bcmcfc
      bcmcfc over 8 years
    • axiac
      axiac over 8 years
      @Ben there is no guarantee that the first 4 or 5 or whatever characters you prefer produce an unique sequence. It is just an observation that, for most repositories, 5 or 6 characters are enough to unique identify an object in the repository (be it a commit or other internal object). Larger repositories need 7 characters (but they contain millions of objects).
    • axiac
      axiac over 8 years
      It happens that an older similar question is popular at this moment. Read it and its answers for explanation.
    • jub0bs
      jub0bs over 8 years
      This may be of interest: stackoverflow.com/questions/32405922/…
    • VonC
      VonC over 6 years
      As I mention in my answer below and in more detail with "Why doesn't Git use more modern SHA?", SHA1 will soon be only one of the possible hashes to use with Git.
    • Magnus
      Magnus over 5 years
      @Marcoun git hashes are all lowercase and only use [a-f] and [0-9], and the OP spoke of 4 characters not 5, so it's really more like 16 * 16 * 16 * 16 which is 65536. Collisions in first 4 chars are uncommon but certainly possible, in which case you just specify more chars.
  • Ben
    Ben over 8 years
    So in theory: If there aren't enough characters to make the non-abbreviated hash unique, it makes every hash longer? I assume the hash is stored under a unique ID and the hash is only used for calling convenience.
  • Chris Maes
    Chris Maes over 8 years
    the real sha is 20 bytes long. git log --abbrev-commit will just abreviate this sha to the X first characters. So in many cases the first 4 characters will suffice to indicate the correct commit, but if not git will show the first 5, 6, 7, ... characters of the sha
  • Raymond Chen
    Raymond Chen over 8 years
    @Ben If two commits have the same hash, then git breaks down. Git relies on the collision-resistance of SHA-1. (In practice, if you encounter such a collision, which I believe has never occurred outside of intentional attempts to create collisions), you can just add a space to the end of the commit message, which will generate a new hash.
  • Chris Maes
    Chris Maes over 8 years
    @RaymondChen maybe adding a space isn't even necessary, since the timestamp of the commit is also used in the generation of the hash. Just creating the same commit again, but a little later will suffice
  • PHcoDer
    PHcoDer about 2 years
    @ChrisMaes, since the commit message is also used to generate the commit hash, what will happen to the commit id when someone edits the commit message (.git/COMMIT_EDITMSG) before pushing to remote?
  • PHcoDer
    PHcoDer about 2 years
    @ChrisMaes, and the same questions as above when we ammend (--amend) the edit.
  • Chris Maes
    Chris Maes about 2 years
    @PHcoDer as you might expect from my answer: the commit hash will change in both cases.
  • PHcoDer
    PHcoDer about 2 years
    @ChrisMaes, yes it should. But I don't see that. What will trigger that change? I edited the file .git/COMMIT_EDITMSG saved it and the hash didn't change. So, it should be changing at the time of push? Or next amend to the commit?
  • Chris Maes
    Chris Maes about 2 years
    @PHcoDer I don't see in what case you should edit the .git/COMMIT_EDITMSG file. If you run git commit --amend and you edit the message, then you will see that the hash changes.