What does depth for git clone mean?

15,045

Solution 1

As Jonathon Reinhart commented, you're seeing the effect of merges.

The --depth parameter refers to how deep Git goes on a "walk" from each starting point. As the documentation you quoted mentions, it also implies --single-branch, which simplifies talking about this. The important point here is that the walk visits all parents of each commit, which—for each depth level—is more than one commit if the commit itself is a merge.

Suppose we have a commit graph that looks like this:

$ git log --graph --oneline master
* cf68824 profile: fix PATH with GOPATH
* 7c2376b profile: add Ruby gem support
* 95c8270 profile: set GOPATH
* 26a9cc3 vimrc: fiddle with netrw directory display
* 80b88a5 add ruby gems directory to path
[snip]

Here, each commit has just one parent. If we use --depth 3 we'll pick up the tip commit cf68824, its parent 7c2376b at depth 2, and finally 95c8270 at depth 3—and then we stop, with three commits.

With the Git repository for Git, however:

$ git log --graph --oneline master
*   965798d1f2 Merge branch 'es/format-patch-range-diff-fix-fix'
|\  
| * ac0edf1f46 range-diff: always pass at least minimal diff options
* |   5335669531 Merge branch 'en/rebase-consistency'
|\ \  
| * | 6fcbad87d4 rebase docs: fix incorrect format of the section Behavioral Differences
* | | 7e75a63d74 RelNotes 2.20: drop spurious double quote
* | | 7a49e44465 RelNotes 2.20: clarify sentence
[snip]

With --depth 3, we start with 965798d1f2, then—for depth 2—pick up both parents, ac0edf1f46 and 5335669531. To add the depth-3 commits, we pick up all the parents of those two commits. The (lone) parent of ac0edf1f46 is not visible here, while the two parents of 5335669531 are (namely 6fcbad87d4 and 7e75a63d74). To get the hash IDs of the parents of ac0edf1f46 we can use:

$ git rev-parse ac0edf1f46^@
d8981c3f885ceaddfec0e545b0f995b96e5ec58f

so that gives us our six commits: the tip of master (which is currently a merge commit), two parents of that commit, one parent of one of those parents, and two parents of the other of that parent.

Depending on precisely when you ran the clone of Git, the tip-most master is often not a merge, but often has a merge as its immediate parent, so that --depth 2 will often get you 3 commits, and --depth 3 will therefore get at least 5, depending on whether the two parents of the tip of master are themselves merges.

(Compare the above git rev-parse output with:

$ git rev-parse 965798d1f2^@
5335669531d83d7d6c905bcfca9b5f8e182dc4d4
ac0edf1f46fcf9b9f6f1156e555bdf740cd56c5f

for instance. The ^@ suffix means all parents of the commit, but not the commit itself.)

Solution 2

--depth means the number of commits to grab when you clone.

By default git download all your history of all branches. Meaning that your copy will have to all history, so you will be able to "switch" (checkout) to any commit you wish.

Adding the --depth limit the size fo your clone and checkout only the X last commits

# Cloning a  single branch with the following:
# clone specific branch and limit the history to last X commits
git clone --branch<...> --depth=<X>

How does the value for depth correspond to the actual amount of data downloaded? with the --depth git will only download the content corresponding to the commits in the given range so the size of the repo will raise when the value is larger


This would indicate that will equal the number of commits that will be fetched during the

Not always, if any of those commits is a merge (for example no fast forward) you will get more than X commits.


How to clean your binary:

Rewriting git's history just to get rid of them seems like too much trouble

This tool can do it for you:

https://rtyley.github.io/bfg-repo-cleaner

###BFG Repo-Cleaner an alternative to git-filter-branch.

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history:

*** Removing Crazy Big Files***

  • Removing Passwords, Credentials & other Private data

Examples (from the official site) In all these examples bfg is an alias for java -jar bfg.jar.

# Delete all files named 'id_rsa' or 'id_dsa' :
bfg --delete-files id_{dsa,rsa}  my-repo.git
Share:
15,045

Related videos on Youtube

Machta
Author by

Machta

Updated on August 03, 2022

Comments

  • Machta
    Machta over 1 year

    We tried to speed up the CI build of one of our software projects at work. Somebody committed some huge (by git's standards) binaries early in the project's life. Rewriting git's history just to get rid of them seems like too much trouble, so we figured doing a shallow clone that avoided those big early commits would be good enough.

    I did some experiments with --depth parameter for clone and encountered some weird behavior. This is what help for git clone says about it:

    --depth <depth>
               Create a shallow clone with a history truncated to the specified number of commits. Implies
               --single-branch unless --no-single-branch is given to fetch the histories near the tips of all
               branches. If you want to clone submodules shallowly, also pass --shallow-submodules.
    

    This would indicate that <depth> will equal the number of commits that will be fetched during the clone, but it's not the case. This is what I got when I tried different values for depth:

    | depth   | commit count linux repo | commit count git repo |
    |---------|-------------------------|-----------------------|
    | 1       | 1                       | 1                     |
    | 5       | 15                      | 13                    |
    | 10      | 80                      | 46                    |
    | 100     | 93133                   | 39552                 |
    | 1000    | 788718                  | 53880                 |
    

    For cloning I used this command git clone --depth 10 https://github.com/torvalds/linux.git, git clone --depth 100 https://github.com/git/git.git, and for counting the commits I used this git log --oneline | wc -l. (At work I observed the same thing with a GitLab server, so it can't be an artifact of how GitHub works.)

    Does anybody know what is going on? How does the value for depth correspond to the actual amount of data downloaded? Do I understand the documentation wrongly, or is there a bug?

    EDIT: I added results for a second repo

    • Seng Cheong
      Seng Cheong over 5 years
      I imagine that merge commits could affect what you're seeing.
  • Charlie Fish
    Charlie Fish over 5 years
    Not sure this really answers the OP’s question. The OP gave examples of how the depth is not what it’s expected to be. The examples the OP gave don’t match what you said about the depth command.
  • Machta
    Machta over 5 years
    That's nice, but it doesn't answer the question. I know there are tools for this (I even used this tool before). But unless there is a way to do this without rewriting the history, it's a no go for me.
  • Machta
    Machta over 5 years
    BTW this might be the single worst thing about git. Once you push, it's there forever. Unless you don't want people screaming at you that is... (Yes, it has happened before. OK, it was a little less dramatic :) This is a huge gotcha for git noobs. I would like a warning when you git add -A a binary that's bigger than say 100kB.
  • CodeWizard
    CodeWizard over 5 years
    You can use git hooks to do it. GITHUB warn you for doing so for the past few years