How and/or why is merging in Git better than in SVN?

68,244

Solution 1

The claim of why merging is better in a DVCS than in Subversion was largely based on how branching and merge worked in Subversion a while ago. Subversion prior to 1.5.0 didn't store any information about when branches were merged, thus when you wanted to merge you had to specify which range of revisions that had to be merged.

So why did Subversion merges suck?

Ponder this example:

      1   2   4     6     8
trunk o-->o-->o---->o---->o
       \
        \   3     5     7
b1       +->o---->o---->o

When we want to merge b1's changes into the trunk we'd issue the following command, while standing on a folder that has trunk checked out:

svn merge -r 2:7 {link to branch b1}

… which will attempt to merge the changes from b1 into your local working directory. And then you commit the changes after you resolve any conflicts and tested the result. When you commit the revision tree would look like this:

      1   2   4     6     8   9
trunk o-->o-->o---->o---->o-->o      "the merge commit is at r9"
       \
        \   3     5     7
b1       +->o---->o---->o

However this way of specifying ranges of revisions gets quickly out of hand when the version tree grows as subversion didn't have any meta data on when and what revisions got merged together. Ponder on what happens later:

           12        14
trunk  …-->o-------->o
                                     "Okay, so when did we merge last time?"
              13        15
b1     …----->o-------->o

This is largely an issue by the repository design that Subversion has, in order to create a branch you need to create a new virtual directory in the repository which will house a copy of the trunk but it doesn't store any information regarding when and what things got merged back in. That will lead to nasty merge conflicts at times. What was even worse is that Subversion used two-way merging by default, which has some crippling limitations in automatic merging when two branch heads are not compared with their common ancestor.

To mitigate this Subversion now stores meta data for branch and merge. That would solve all problems right?

And oh, by the way, Subversion still sucks…

On a centralized system, like subversion, virtual directories suck. Why? Because everyone has access to view them… even the garbage experimental ones. Branching is good if you want to experiment but you don't want to see everyones' and their aunts experimentation. This is serious cognitive noise. The more branches you add, the more crap you'll get to see.

The more public branches you have in a repository the harder it will be to keep track of all the different branches. So the question you'll have is if the branch is still in development or if it is really dead which is hard to tell in any centralized version control system.

Most of the time, from what I've seen, an organization will default to use one big branch anyway. Which is a shame because that in turn will be difficult to keep track of testing and release versions, and whatever else good comes from branching.

So why are DVCS, such as Git, Mercurial and Bazaar, better than Subversion at branching and merging?

There is a very simple reason why: branching is a first-class concept. There are no virtual directories by design and branches are hard objects in DVCS which it needs to be such in order to work simply with synchronization of repositories (i.e. push and pull).

The first thing you do when you work with a DVCS is to clone repositories (git's clone, hg's clone and bzr's branch). Cloning is conceptually the same thing as creating a branch in version control. Some call this forking or branching (although the latter is often also used to refer to co-located branches), but it's just the same thing. Every user runs their own repository which means you have a per-user branching going on.

The version structure is not a tree, but rather a graph instead. More specifically a directed acyclic graph (DAG, meaning a graph that doesn't have any cycles). You really don't need to dwell into the specifics of a DAG other than each commit has one or more parent references (which what the commit was based on). So the following graphs will show the arrows between revisions in reverse because of this.

A very simple example of merging would be this; imagine a central repository called origin and a user, Alice, cloning the repository to her machine.

         a…   b…   c…
origin   o<---o<---o
                   ^master
         |
         | clone
         v

         a…   b…   c…
alice    o<---o<---o
                   ^master
                   ^origin/master

What happens during a clone is that every revision is copied to Alice exactly as they were (which is validated by the uniquely identifiable hash-id's), and marks where the origin's branches are at.

Alice then works on her repo, committing in her own repository and decides to push her changes:

         a…   b…   c…
origin   o<---o<---o
                   ^ master

              "what'll happen after a push?"


         a…   b…   c…   d…   e…
alice    o<---o<---o<---o<---o
                             ^master
                   ^origin/master

The solution is rather simple, the only thing that the origin repository needs to do is to take in all the new revisions and move it's branch to the newest revision (which git calls "fast-forward"):

         a…   b…   c…   d…   e…
origin   o<---o<---o<---o<---o
                             ^ master

         a…   b…   c…   d…   e…
alice    o<---o<---o<---o<---o
                             ^master
                             ^origin/master

The use case, which I illustrated above, doesn't even need to merge anything. So the issue really isn't with merging algorithms since three-way merge algorithm is pretty much the same between all version control systems. The issue is more about structure than anything.

So how about you show me an example that has a real merge?

Admittedly the above example is a very simple use case, so lets do a much more twisted one albeit a more common one. Remember that origin started out with three revisions? Well, the guy who did them, lets call him Bob, has been working on his own and made a commit on his own repository:

         a…   b…   c…   f…
bob      o<---o<---o<---o
                        ^ master
                   ^ origin/master

                   "can Bob push his changes?" 

         a…   b…   c…   d…   e…
origin   o<---o<---o<---o<---o
                             ^ master

Now Bob can't push his changes directly to the origin repository. How the system detects this is by checking if Bob's revisions directly descents from origin's, which in this case doesn't. Any attempt to push will result into the system saying something akin to "Uh... I'm afraid can't let you do that Bob."

So Bob has to pull-in and then merge the changes (with git's pull; or hg's pull and merge; or bzr's merge). This is a two-step process. First Bob has to fetch the new revisions, which will copy them as they are from the origin repository. We can now see that the graph diverges:

                        v master
         a…   b…   c…   f…
bob      o<---o<---o<---o
                   ^
                   |    d…   e…
                   +----o<---o
                             ^ origin/master

         a…   b…   c…   d…   e…
origin   o<---o<---o<---o<---o
                             ^ master

The second step of the pull process is to merge the diverging tips and make a commit of the result:

                                 v master
         a…   b…   c…   f…       1…
bob      o<---o<---o<---o<-------o
                   ^             |
                   |    d…   e…  |
                   +----o<---o<--+
                             ^ origin/master

Hopefully the merge won't run into conflicts (if you anticipate them you can do the two steps manually in git with fetch and merge). What later needs to be done is to push in those changes again to origin, which will result into a fast-forward merge since the merge commit is a direct descendant of the latest in the origin repository:

                                 v origin/master
                                 v master
         a…   b…   c…   f…       1…
bob      o<---o<---o<---o<-------o
                   ^             |
                   |    d…   e…  |
                   +----o<---o<--+

                                 v master
         a…   b…   c…   f…       1…
origin   o<---o<---o<---o<-------o
                   ^             |
                   |    d…   e…  |
                   +----o<---o<--+

There is another option to merge in git and hg, called rebase, which'll move Bob's changes to after the newest changes. Since I don't want this answer to be any more verbose I'll let you read the git, mercurial or bazaar docs about that instead.

As an exercise for the reader, try drawing out how it'll work out with another user involved. It is similarly done as the example above with Bob. Merging between repositories is easier than what you'd think because all the revisions/commits are uniquely identifiable.

There is also the issue of sending patches between each developer, that was a huge problem in Subversion which is mitigated in git, hg and bzr by uniquely identifiable revisions. Once someone has merged his changes (i.e. made a merge commit) and sends it for everyone else in the team to consume by either pushing to a central repository or sending patches then they don't have to worry about the merge, because it already happened. Martin Fowler calls this way of working promiscuous integration.

Because the structure is different from Subversion, by instead employing a DAG, it enables branching and merging to be done in an easier manner not only for the system but for the user as well.

Solution 2

Historically, Subversion has only been able to perform a straight two-way merge because it's didn't store any merge information. This involves taking a set of changes and applying them to a tree. Even with merge information, this is still the most commonly-used merge strategy.

Git uses a 3-way merge algorithm by default, which involves finding a common ancestor to the heads being merged and making use of the knowledge that exists on both sides of the merge. This allows Git to be more intelligent in avoiding conflicts.

Git also has some sophisticated rename finding code, which also helps. It doesn't store changesets or store any tracking information -- it just stores the state of the files at each commit and uses heuristics to locate renames and code movements as required (the on-disk storage is more complicated than this, but the interface it presents to the logic layer exposes no tracking).

Solution 3

Put simply, the merge implementation is done better in Git than in SVN. Before 1.5 SVN did not record a merge action, so it was incapable to do future merges without help by the user which needed to provide information that SVN did not record. With 1.5 it got better, and indeed the SVN storage model is slightly more capable that Git's DAG. But SVN stored the merge information in a rather convoluted form that lets merges take massively more time than in Git - I've observed factors of 300 in execution time.

Also, SVN claims to track renames to aid merges of moved files. But actually it still stores them as a copy and a separate delete action, and the merge algorithm still stumbles over them in modify/rename situations, that is, where a file is modified on one branch and rename on the other, and those branches are to be merged. Such situations will still produce spurious merge conflicts, and in the case of directory renames it even leads to silent loss of modifications. (The SVN people then tend to point out that the modifications are still in the history, but that doesn't help much when they aren't in a merge result where they should appear.

Git, on the other hand, does not even track renames but figures them out after the fact (at merge time), and does so pretty magically.

The SVN merge representation also has issues; in 1.5/1.6 you could merge from trunk to branch as often as just liked, automatically, but a merge in the other direction needed to be announced (--reintegrate), and left the branch in an unusable state. Much later they found out that this actually isn't the case, and that a) the --reintegrate can be figured out automatically, and b) repeated merges in both directions are possible.

But after all this (which IMHO shows a lack of understanding of what they are doing), I'd be (OK, I am) very cautions to use SVN in any nontrivial branching scenario, and would ideally try to see what Git thinks of the merge result.

Other points made in the answers, as the forced global visibility of branches in SVN, aren't relevant to merge capabilities (but for usability). Also, the 'Git stores changes while SVN stores (something different)' are mostly off the point. Git conceptually stores each commit as a separate tree (like a tar file), and then uses quite some heuristics to store that efficiently. Computing the changes between two commits is separate from the storage implementation. What is true is that Git stores the history DAG in a much more straightforward form that SVN does its mergeinfo. Anyone trying to understand the latter will know what I mean.

In a nutshell: Git uses a much simpler data model to store revisions than SVN, and thus it could put a lot of energy into the actual merge algorithms rather than trying to cope with the representation => practically better merging.

Solution 4

One thing that hasn't been mentioned in the other answers, and that really is a big advantage of a DVCS, is that you can commit locally before you push your changes. In SVN, when I had some change I wanted to check in, and someone had already done a commit on the same branch in the meantime, this meant that I had to do an svn update before I could commit. This means that my changes, and the changes from the other person are now mixed together, and there is no way to abort the merge (like with git reset or hg update -C), because there is no commit to go back to. If the merge is non-trivial,this means that you can't continue to work on your feature before you have cleaned up the merge result.

But then, maybe that is only an advantage for people who are too dumb to use separate branches (if I remember correctly, we had only one branch that was used for development back in the company where I used SVN).

Solution 5

EDIT: This is primarily addressing this part of the question:
Is this actually due to inherent differences in how the two systems work, or do specific DVCS implementations like Git/Mercurial just have cleverer merging algorithms than SVN?
TL;DR - Those specific tools have better algorithms. Being distributed has some workflow benefits, but is orthogonal to the merging advantages.
END EDIT

I read the accepted answer. It's just plain wrong.

SVN merging can be a pain, and it can also be cumbersome. But, ignore how it actually works for a minute. There is no information that Git keeps or can derive that SVN doesn't also keep or can derive. More importantly, there is no reason why keeping separate (sometimes partial) copies of the version control system will provide you with more actual information. The two structures are completely equivalent.

Assume you want to do "some clever thing" Git is "better at". And you're thing is checked into SVN.

Convert your SVN into the equivalent Git form, do it in Git, and then check the result in, perhaps using multiple commits, some extra branches. If you can imagine an automated way to turn an SVN problem into a Git problem, then Git has no fundamental advantage.

At the end of the day, any version control system will let me

1. Generate a set of objects at a given branch/revision.
2. Provide the difference between a parent child branch/revisions.

Additionally, for merging it's also useful (or critical) to know

3. The set of changes have been merged into a given branch/revision.

Mercurial, Git and Subversion (now natively, previously using svnmerge.py) can all provide all three pieces of information. In order to demonstrate something fundamentally better with DVC, please point out some fourth piece of information which is available in Git/Mercurial/DVC not available in SVN / centralized VC.

That's not to say they're not better tools!

Share:
68,244
Mr. Boy
Author by

Mr. Boy

SOreadytohelp

Updated on August 03, 2022

Comments

  • Mr. Boy
    Mr. Boy almost 2 years

    I've heard in a few places that one of the main reasons why distributed version control systems shine, is much better merging than in traditional tools like SVN. Is this actually due to inherent differences in how the two systems work, or do specific DVCS implementations like Git/Mercurial just have cleverer merging algorithms than SVN?

  • Mr. Boy
    Mr. Boy about 14 years
    That was one of the articles i was thinking about before posting here. But "thinks in terms of changes" is a very vague marketing-sounding term (remember Joel's company sells DVCS now)
  • Mr. Boy
    Mr. Boy about 14 years
    I don't agree with your branches==noise argument. Lots of branches doesn't confuse people because the lead dev should tell people which branch to use for big features... so two devs might work on branch X to add "flying dinosaurs", 3 might work on Y to "let you throw cars at people"
  • Troj
    Troj about 14 years
    John: Yes, for small number of branches there is little noise and is managable. But come back after you've witnessed 50+ branches and tags or so in subversion or clear case where most of them you can't tell if they're active or not. Usability issue from the tools aside; why have all that litter around in your repository? At least in p4 (since a user's "workspace" is essentially a per-user branch), git or hg you've got the option to not let everyone know about the changes you do until you push them upstream, which is a safe-guard for when the changes are relevant to others.
  • Mr. Boy
    Mr. Boy about 14 years
    Well I think personally I'd consider a branch for deletion every time it is merged back to trunk, although of course in an iterative build process it might happen many times before a feature is marked done. Perhaps feature-branches are better for a waterfall model, where you can deliver a new feature and close the branch.
  • John Smithers
    John Smithers about 14 years
    I don't get your "too many experimental branches are noise argument either, @Spoike. We have a "Users" folder where every user has his own folder. There he can branch as often as he wishes. Branches are inexpensive in Subversion and if you ignore the folders of the other users (why should you care about them anyway), then you don't see noise. But for me merging in SVN does not suck (and I do it often, and no, it's not a small project). So maybe I do something wrong ;) Nevertheless the merging of Git and Mercurial is superior and you pointed it out nicely.
  • Troj
    Troj about 14 years
    @John Smithers: I do admit that I might be a bit inflammatory with some of my claims; but that's pretty much the nature of the criticism that SVN gets. I don't really hate SVN, but I have seen one project where all branches are in one virtual directory and reintegrating branches with trunk and back again makes people pull their hair out. So the whole branches==noise argument is not valid for well organized projects however subversion does not enforce well organized projects by convention at all.
  • John Smithers
    John Smithers about 14 years
    @Spoike: Maybe they should have read the documentation first. Saves a lot of hair pulling ;)
  • Troj
    Troj about 14 years
    @John Smithers: Well, the svn book is quite verbose and apologetic about branching. The whole chapter about branching alone made me realize very early (coming from a CVS background) that branching feels "tacked on" in SVN; as if you don't need it and the authors sometimes feel like they're sorry that SVN has virtual directories and cheap copies. In git and hg, you can't even get code without branching (by cloning the repository); it has to be explained from start.
  • Troj
    Troj about 14 years
    I thought that was vague as well... I always thought changesets was an integral part to versions (or revisions rather), which surprises me that some programmers don't think in terms of changes.
  • Ken Liu
    Ken Liu about 14 years
    In svn it's easy to kill inactive branches, you just delete them. The fact that people don't remove unused branches therefore creating clutter is just a matter of housekeeping. You could just as easily wind up with lots of temporary branches in Git as well. In my workplace we use a "temp-branches" top-level directory in addition to the standard ones - personal branches and experimental branches go in there instead of cluttering the branches directory where "official" lines of code are kept (we don't use feature branches).
  • Avi
    Avi over 13 years
    I don't think this answer answers the question. The biggest difference between Git/Mercurial and Subversion in how they merge is that the DVCSs track revision graphs, while Subversion history is a tree. But is that inherently so? Is there any reason Subversion couldn't model history as a tree, while remaining a centralized VCS?
  • Troj
    Troj over 13 years
    Avi: I think you're asking a whole other question. Practically a SVN repository models the history as a sequence of commits, touching relevant files under the same directory. It is still a centralized VCS. Subversion wasn't designed to be a distributed VCS.
  • RaviG
    RaviG over 13 years
    Does this mean then, that from v1.5 subversion can at least merge as well as git can?
  • VonC
    VonC about 13 years
  • Peter
    Peter about 13 years
    Yeah, I answered the question in the details, not the headline. svn and git have access to the same information (actually typically svn has more), so svn could do whatever git does. But, they made different design decisions, and so it actually doesn't. The proof on the DVC / centralized is you can run git as a centralized VC (perhaps with some rules imposed) and you can run svn distributed (but it totally sucks). However, this is all too academic for most people - git and hg do branching and merging better than svn. That's really what matters when choosing a tool :-).
  • ripper234
    ripper234 almost 13 years
    So ... bottom line, is there a simple merge scenario which would be easier on git than on modern svn? This is a great answer, but I'd like to know the TL;DR answer (although I did read your answer, and it is great!)
  • Troj
    Troj almost 13 years
    @ripper234: There is no easy TL;DR answer on this topic without it defaulting to it be completely unmotivated nerd rage, tech hate or flame bait.
  • Troj
    Troj almost 13 years
    @ripper234: Also, yes: the "fast-forward" example in the answer is so much more easier to do than in svn because the merge command does not need any arguments to do it right.
  • teambob
    teambob about 12 years
    This answer probably needs to be updated now that subversion supports merge tracking: svn.apache.org/repos/asf/subversion/trunk/notes/merge-tracki‌​ng/…
  • Troj
    Troj about 12 years
    @teambob: Nah. I already mentioned this in the second sentence that SVN didn't store merge info prior to 1.5.0 with the exact same link.
  • Max
    Max about 12 years
    For a system that really "thinks in terms of changes", check out Darcs
  • Anonigan
    Anonigan about 12 years
    Up till version 1.5 Subversion didn't store all necessary information. Wven with post-1.5 SVN the information stored is different: Git stores all parents of a merge commit, while Subversion stores what revisions were already merged in into branch.
  • Paul Mendoza
    Paul Mendoza about 12 years
    SVN actually works like Git if you do the steps right for SVN branches. When you're ready to merge a branch back into trunk, you should merge the most recent changes from trunk into the branch (pull) and you'll get your merge conflicts at this point. Then when you go to merge(push) your changes into trunk it will work perfectly.
  • user276641
    user276641 about 12 years
    A tool that is hard to re-implement on an svn repository is git merge-base. With git, you can say "branches a and b split at revision x". But svn stores "files were copied from foo to bar", so you need to use heuristics to work out that the copy to bar was creating a new branch instead of copying files within a project. The trick is that a revision in svn is defined by revision number and the base path. Even though it is possible to assume "trunk" most of the time, it bites if there actually are branches.
  • tripleee
    tripleee over 11 years
    @Max: sure, but when push comes to shove, Git delivers where Darcs is basically just as painful as Subversion when it comes to actually merging.
  • Ingo Blackman
    Ingo Blackman almost 11 years
    "3. The set of changes have been merged into a given branch/revision." this is exactly what svn does not provide; at least not completely. Since svn 1.5 the "svn:mergeinfo" tries to record some of it, but still not all; which is exactly the reason why even centralized some things do not work with svn, which do work with for example git. See example here: stackoverflow.com/a/13964697/1917520
  • Richard Corfield
    Richard Corfield almost 11 years
    Re: "There is no information that git keeps or can derive that svn doesn't also keep or can derive." - I found that SVN didn't remember when things had been merged. If you like to pull work from trunk into your branch and go back and forth then merging can become hard. In Git each node in its revision graph knows where it came from. It has up to two parents and some local changes. I'd trust Git to be able to merge more than SVN. If you merge in SVN and delete the branch then the branch history is lost. If you merge in GIT and delete the branch the graph remains, and with it the "blame" plugin.
  • Rolf
    Rolf over 10 years
    That's what I read too, and that's what I was counting on, but it's not working, in practice.
  • locka
    locka almost 10 years
    Git is very good at merges providing you use commit merges to join branches. It stores these joins as directed acyclic graphs so if I merge B to A it can figure out their common ancestor without me screwing around trying to figure it out. If I do another merge some time later, the new common ancestor is that first merge and so on. This not the case with a squash merge which is basically a diff slammed into one branch with no join. I've seen people get in a lot of trouble with squash merges where the branches (as far as git knows) seem to wildly conflict with each other over time.
  • locka
    locka almost 10 years
    The three disadvantages of Git is a) it's not so good for binaries like document management where it is very unlikely people will want to branch and merge b) it assumes you want to clone EVERYTHING c) it stores the history of everything in the clone even for frequently changing binaries causing clone bloat. I think a centralized VCS is far better for those use cases. Git is far better for regular development particularly for merging and branching.
  • SantiBailors
    SantiBailors over 8 years
    @PaulMendoza That doesn't match many people's experience including mine, unless "do the steps right for SVN branches" means something so exotic that many people don't do it (and BTW branching should be trivial). What you are talking about is the theory. In practice even when a team simply branch from trunk, commit to the branch, periodically merge the trunk down into the branch, and finally merge the branch back into trunk, a lot of [tree] conflict nonsense can happen. Even on identical files that nobody has touched neither on the branch nor in trunk. Hundreds of such files, and often.
  • Warren Dew
    Warren Dew almost 8 years
    Isn't it the case that git and mercurial have all the necessary information locally, though, while svn needs to look at both local and central data to derive the information?
  • Gqqnbig
    Gqqnbig almost 7 years
    Do you have an example that svn has merge conflict but git doesn't?
  • Ferrybig
    Ferrybig over 6 years
    Git tracks the content of files, it only shows the content as changes
  • Voo
    Voo about 5 years
    This is basically saying "SVN is just as good as git because it lets you use git to do the actual work!", which is rather.. well not helpful. It's also not true that SVN keeps the same amount of data as Git. What is true is that the limited amount of data kept by SVN can be transformed into a Git tree and from there worked on - but that's something rather different.
  • Peter
    Peter about 5 years
    @Voo - I've edited the answer to provide some clarity, which perhaps you missed in my earlier comment. But, no, I never said SVN is as good as Git, I said being centralized or decentralized was not the root of the differences. If there is some information in Git, but not in SVN, then someone should be able to point out what that is, but it's hard to imagine what that might be. Hint - if you can convert to Git from SVN, and you have everything you would have had if you'd done it in Git to start with, then it contains all the information that Git has.
  • Voo
    Voo about 5 years
  • Erik Aronesty
    Erik Aronesty almost 5 years
    i agree with your analysis, but in my personal experience, git is generally worse at the actual mechanisms of merging. in other words... when there is a conflict subversion is better at figuring out what region of the code conflicts, which lines to show on one side and the other. and when there's a conflict that can be auto-merged, subversion seems to make fewer mistakes. git seems to make more language-specific insertion errors, etc. worse, if there are no conflicts, git will also sometimes find them, because it doesnt know the difference between a merge commit and a regular commit.