Version control for large binary files and >1TB repositories?

20,915

Solution 1

Version control systems are for source code, not binary builds. You are better off just using standard network file server backup tapes for binary file backup - even though it's largely unnecessary when you have source code control since you can just rebuild any version of any binary at any time. Trying to put binaries in source code control is a mistake.

What you are really talking about is a process known as configuration management. If you have thousands of unique software packages, your business should have a configuration manager (a person, not software ;-) ) who manages all of the configurations (a.k.a. builds) for development, testing, release, release-per-customer, etc.

Solution 2

Take a look at Boar, "Simple version control and backup for photos, videos and other binary files". It can easily handle huge files and huge repositories.

Solution 3

Old question, but perhaps worth pointing out that Perforce is in use at lots of large companies, and particular in games development companies, where multi-Terabyte repositories with many large binary files.

(Disclaimer: I work at Perforce)

Solution 4

Update May 2017:

Git, with the addition of GVFS (Git Virtual File System), can support virtually any number of files of any size (starting with the Windows repository itself: "The largest Git repo on the planet" (3.5M files, 320GB).
This is not yet >1TB, but it can scale there.

The work done with GVFS is slowly proposed upstream (that is to Git itself), but that is still a work in progress.
GVFS is implement on Windows, but will soon be done for Mac (because the team at Windows developing Office for Mac demands it), and Linux.


April 2015

Git can actually be considered as a viable VCS for large data, with Git Large File Storage (LFS) (by GitHub, april 2015).

git-lfs (see git-lfs.github.com) can be tested with a server supporting it: lfs-test-server (or directly with github.com itself):
You can store metadata only in the git repo, and the large file elsewhere.

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

Solution 5

  • store large binary files (>1GB)
  • support a repository that's >1TB (yes, that's TB)

Yep, that is one of the cases Apache Subversion should fully support.

So far I've got some experience with SVN and CVS, however I'm not quite satisfied with the performance of both with large binary files (a few MSI or CAB files will be >1GB). Also, I'm not sure if they scale well with the amount of data we're expecting in the next 2-5 years (like I said, estimated >1TB)

Up-to-date Apache Subversion servers and clients should have no problems controlling such amount of data and they perfectly scale. Moreover, there are various repository replication approaches that should improve performance in case you have multiple sites with developers working on the same projects.

I'm currently also looking into SVN Externals as well as Git Submodules, though that would mean several individual repositories for each software package and I'm not sure that's what we want..

svn:externals have nothing to do with the support for large binaries or multiterabyte projects. Subversion perfectly scales and supports very large data and code base in a single repository. But Git does not. With Git, you'll have to divide and split the projects to multiple small repositories. This is going to lead to a lot of drawbacks and a constant PITA. That's why Git has a lot of add-ons such as git-lfs that try to make the problem less painful.

Share:
20,915
Christoph Voigt
Author by

Christoph Voigt

Updated on April 25, 2020

Comments

  • Christoph Voigt
    Christoph Voigt about 4 years

    Sorry to come up with this topic again, as there are soo many other questions already related - but none that covers my problem directly.

    What I'm searching is a good version control system that can handle only two simple requirements:

    1. store large binary files (>1GB)
    2. support a repository that's >1TB (yes, that's TB)

    Why? We're in the process of repackaging a few thousand software applications for our next big OS deployment and we want those packages to follow version control.

    So far I've got some experience with SVN and CVS, however I'm not quite satisfied with the performance of both with large binary files (a few MSI or CAB files will be >1GB). Also, I'm not sure if they scale well with the amount of data we're expecting in the next 2-5 years (like I said, estimated >1TB)

    So, do you have any recommendations? I'm currently also looking into SVN Externals as well as Git Submodules, though that would mean several individual repositories for each software package and I'm not sure that's what we want..