Memory problems when compressing and transferring a large number of small files (1TB in total)

13,097

Solution 1

Additional information provided in the comments reveals that the OP is using a GUI method to create the .tar.gz file.

GUI software often includes a lot more bloat than the equivalent command line equivalent software, or performs additional unnecessary tasks for the sake of some "extra" feature such as a progress bar. It wouldn't surprise me if the GUI software is trying to collect a list of all the filenames in memory. It's unnecessary to do that in order to create an archive. The dedicated tools tar and gzip are defintely designed to work with streaming input and output which means that they can deal with input and output a lot bigger than memory.

If you avoid the GUI program, you can most likely generate this archive using a completely normal everyday tar invocation like this:

tar czf foo.tar.gz foo

where foo is the directory that contains all your 5 million files.

The other answers to this question give you a couple of additional alternative tar commands to try in case you want to split the result into multiple pieces, etc...

Solution 2

"five million" files, and 1TB in total? Your files must be very small, then. I'd simply try rsync:

rsync -alPEmivvz /source/dir remote.host.tld:/base/dir

If you don't have that - or your use-case doesn't allow for using rsync, I'd at least check if 7z works with your data. It might not, but I think it's still worth a try:

7z a archive.7z /source/dir

Or if you don't feel comfortable with 7z at least try making a .tar.xz archive:

tar cJv archive.tar.xz /source/dir

(it should be noted, that older versions of tar don't create .tar.xz archives, but .tar.lzma archives, when using the J switch. Even yet older versions of tar, don't support the J flag altogether.)


Since you're using a GUI program to create those files, I'm assuming you're feeling a bit uncomfortable using a command line interface.

To facilitate creation, management and extraction of archives from the command line interface, there's the small utility called atool. It is available for practically every common distro I've seen, and works pretty much every single archive I've stumbled upon, unless the hopelessly obscure ones.

Check whether your distro has atool in their repos, or ask your admin to install it, when it's in a workplace environment.

atool installs a bunch of symlinks to itself, so packing and unpacking becomes a breeze:

apack archive.tar.xz <files and/or directories>

Creates an archive.

aunpack archive.7z

Expands the archive.

als archive.rar

Lists file contents.

What kind of archive is created, atool discerns that by the filename extension of your archive in the command line.

Solution 3

Unless you can do better than 25:1 compression you are unlikely to gain anything from compressing this before snail-mailing, unless you have some hardware tape format that you can exchange the the third party.

The largest common storage is blue ray and that will roughly get you 40Gb. You would need 25 to 1 compression on your data to get it to fit on that. If your third party only has DVD you need 125:1 (roughly).

If you cannot match those compression numbers just use a normal disc, copy and snail mail that to the third party. In that case shipping something smaller than a 1Tb drive that would need compression is madness.

You just have to compare that to using ssh -C (standard compression) or preferably rsync with compression to copy the files over the network, no need to compress and tar up front. 1Tb is not impossible to move over the net, but it is going to take a while.

Solution 4

Did you consider torrent? Peer-to-Peer might be your best option for an over-the-internet transfer:

  • At least as fast as other internet transfers: your upload speed will determine the transfer speed
  • No data corruption
  • Choose which files to transfer first
  • No extra local/cloud storage space needed
  • Free

You didn't tell which OS you were using, but as you're speaking about tar.gz compression, I'll assume you're using some GNU/Linux-like OS. For that I'll suggest Transmission. It's an open-source torrent software that runs on Mac and Linux. I like it because the developers put an effort into making it native to every GUI clients they support: no cross-platform language.

You could combine this method with compression, however you'll lose the ability to prioritize parts of the transfer.

Solution 5

7z would be my choice. It allows auto-splitting of archives and supports multi-threaded compression. No, xz doesn't, despite what the help message says. Try with:

7za a -v100m -m0=lzma2 -mx=9 -ms=on -mmt=$THREADS archive.7z directory/

The output is split in 100MB blocks (change it with the -v switch).

The only real downside is that 7z does not retain unix metadata (e.g. permissions and owner). If you need that, pipe tar output into 7za instead (see man 7za for some examples).

Share:
13,097

Related videos on Youtube

oshirowanen
Author by

oshirowanen

Updated on September 18, 2022

Comments

  • oshirowanen
    oshirowanen almost 2 years

    I have 5 million files which take up about 1TB of storage space. I need to transfer these files to a third party.

    What's the best way to do this? I have tried reducing the size using .tar.gz, but even though my computer has 8GB RAM, I get an "out of system memory" error.

    Is the best solution to snail-mail the files over?

    • Admin
      Admin about 9 years
      have you tried the "-L" option to tar that allows you to split your output in multiple pieces ?
    • Admin
      Admin about 9 years
      Are you having problems CREATING a .tar.gz or COPYING the resulting compressed file? Either way, something is weird, because neither operation should consume more memory just because the files are big. That is, both operations should be streaming. Please include more information about exactly what commands are failing.
    • Admin
      Admin about 9 years
      How much bandwidth have you and the third party to spare? A naive rsync might save you on postage. But I don't know how "five million" files will work for you because rsync will try to build the filelist in-memory and could if list(5e6 files) > 8 GB. And of course it will be slow.
    • Admin
      Admin about 9 years
      @Kwaio, I have not tried the L option.
    • Admin
      Admin about 9 years
      @Celada, I'm having problems creating the .tar.gz file. I think it consumes 8GB figuring out which files will be compressed, so it doesn't even begin compressing.
    • Admin
      Admin about 9 years
      @oshirowanen I don't think it should consume a bunch of memory computing the file list because I'm pretty sure tar should just archive files incrementally as it lists them, never building up a list in memory. But again, please show the exact command you are using. Also, are all the files in the same directory or is the directory structure very deep?
    • Admin
      Admin about 9 years
      @Celada, I'm not sure what command is being used. I right clicked the folder and clicked "create archive" and selected the .tar.gz option. The directory structure is deep, over 500,000 directories.
    • Admin
      Admin about 9 years
      Ah yes, well GUI programs are often built without giving much importance to such goals as scalability and robustness. It wouldn't surprise me if it's the fault of the GUI wrapper/frontend. Create the file using the command line and I think you will find that it works just fine.
    • Admin
      Admin about 9 years
      Both tar and gzip use a very small buffer, and capping 8GB without doing nothing else seems absurd. If that's really the case it must be a leak of one of the two. Do you mind providing the version you're using of both tar and gzip?
    • Admin
      Admin about 9 years
      You might consider using (on the command line) afio since it is able to compress individual files before archival
    • Admin
      Admin about 9 years
      1 TB of data will take at least 22 hours to transfer on a 100 Mbit/s broadband connection. So depending on how much compression you expect to achieve, snail mail might actually be the faster option.
    • Admin
      Admin about 9 years
      @Celada tar preserves hardlinks. In doing so it would have to maintain a list of the files in memory. But it would be a pretty obvious optimization to only keep that information in memory for files with link count greater than 1.
    • Admin
      Admin about 9 years
      Very good point, @kasperd, and a quick examination of the source code shows that GNU tar does indeed apply that optimization.
  • mveroone
    mveroone about 9 years
    Nice little snippet. Although I think his need here is the compression feature mostly, since the purpose is to "transfer to a friend"
  • Tapas Mondal
    Tapas Mondal about 9 years
    The only real downside but what a downside!
  • user3850506
    user3850506 about 9 years
    @njzk2 actually it depends on the scenario. For instance, if you are sending backup images or database dumps you probably don't care much about permissions.
  • Jonas Schäfer
    Jonas Schäfer about 9 years
    Not fully creating the archive will hurt when the connection interrupts, which is not entirely unlikely while transferring 1 TB, either due to network outage (there are still ISPs which disconnect you every 24 hours) or other reasons.
  • roaima
    roaima about 9 years
    The advantage here of using rsync is that if (when) the connection breaks, rsync can pick up where it left off.
  • Olivier Dulac
    Olivier Dulac about 9 years
    +1: "never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway" (Andrew S. Tanenbaum). see en.wikipedia.org/wiki/Sneakernet
  • Anthon
    Anthon about 9 years
    @OlivierDulac I have seen similar constructs with Boeing 747 and boxes full of CDROMs it is amazing what kind of throughput you can get with that.
  • Tapas Mondal
    Tapas Mondal about 9 years
    I don't quite see the point of using 7z for spliting, when you can use split on a .tar.gz file, and get to keep the metadata.
  • Olivier Dulac
    Olivier Dulac about 9 years
    I love that a pidgin beat an ISP by a long shot, see the wikipedia page's exemples ^^
  • user3850506
    user3850506 about 9 years
    @njzk2 it also splits. Primarily, it has multi-threaded compression with LZMA2. No other unix utility I am aware of supports it. 7z also have a non-solid compression mode, which is a great step forward when only a specific file has to be accessed wrt to the tar approach.
  • Tapas Mondal
    Tapas Mondal about 9 years
  • user3850506
    user3850506 about 9 years
    @njzk2 that's BZip2, not LZMA2.
  • Tapas Mondal
    Tapas Mondal about 9 years
    ctrl-f + "LZMA"
  • Nate Eldredge
    Nate Eldredge about 9 years
    The files would be an average of 200 KB. That isn't all that small.
  • PythonNut
    PythonNut about 9 years
    @NateEldredge I usually think of big as meaning >1GB. Small is usually <1MB. So pretty small.
  • PythonNut
    PythonNut about 9 years
  • mikeserv
    mikeserv about 9 years
    For multithreaded lzma compression see squashfs - which will not only do that format (and others), the archive it creates is a mountable linux r-o fs. It can additionally handle capturing streams into a file created by some other process into a member file within its archive while compressing. 7z is ok - but sfs is way better.
  • user3850506
    user3850506 about 9 years
    @njzk2 I admit I never heard of pxz, the other tool mentioned is 7z itself. Again, the latter is the only one to combine multithreaded LZMA2 with optional non-solid mode and variable block size in solid mode (frankly, tar+compressor sucks when it comes to random access) with autosplit. Definitely my choice, unless I need to preserve unix metadata (for backupping there are more proper tools anyway, like duplicity). On the other hand, squashfs is great idea!
  • AKS
    AKS about 9 years
    Torrent software probably has the same problems compressing GUI software has. Storing file names into memory, etc. Also, torrent files have to store the meta data of the files. 5 million file names should be packed to the torrent file.
  • LaX
    LaX about 9 years
    @AyeshK True, this will impact performance when adding/creating the torrent or checking checksums. Still, I believe this is the most stable solution for transfer of big amount of data.
  • AKS
    AKS about 9 years
    According to torrent freak, the largest torrent ever shared is ~800gb. Single torrent file with most files contained about 33K files. But 5 million files... I'm not sure.