Archive big data into multiple parts

5,314

If you have enough space to store the compressed archive, then the archive could be created and split in one go (assuming GNU split):

tar -c -vz -f - directory | split --additional-suffix=.gz.part -b 1G

This would create files called xaa.gz.part, xab.gz.part etc., each file being a 1G compressed bit of the tar archive.

To extract the archive:

cat x*.gz.part | tar -x -vz -f -

If the filesystem can not store the compressed archive, the archive parts needs to be written to another filesystem, alternative to some remote location.

On that remote location, for example:

ssh user@serverwithfiles tar -c -vz -f - directory | split --additional-suffix=.gz.part -b 1G

This would transfer the compressed archive over ssh from the machine with the big directory to the local machine and split it.

Share:
5,314

Related videos on Youtube

JoshThunar
Author by

JoshThunar

Updated on September 18, 2022

Comments

  • JoshThunar
    JoshThunar over 1 year

    I'm working on big data and I need to archive a directory that is larger than 64 terabytes. I cannot create such large file (archive) on my file system. Unluckily, all proposed solutions for creating a multiple-parts archive on Linux suggest creating an archive first and then splitting it into smaller files with split command.

    I know that it is possible with f.e. 7zip, but unluckily I'm quite forced to use tools built in RedHat 6 - tar, gzip, bzip2...

    I was wondering about creating a script that would ask user for the maximum volume size. It would archive every single file with gzip, split those files, that are too big and then manually merge them into many tars with the chosen size. Is that a good idea?

    Is there any other possibility to achieve big archive division with basic Linux commands?

    UPDATE:

    I've tested the solution on the filesystem with the restricted maximum file size and it worked. The pipe that redirects the tar output directly into split command has worked as intended:

    tar -czf - HugeDirectory | split --bytes=100GB - MyArchive.tgz.

    The created files are already small and when merging them together no supersized files are created:

    cat MyArchive.tgz* | tar -xzf -

    • ajeh
      ajeh almost 6 years
      I am confused: are you trying to compress a single 64+ TB file into .tar.gz? Why do you feel you need .tar in the picture then? .gz should be perfectly fine, and then you can man split if you need multiple files.
  • Kusalananda
    Kusalananda almost 6 years
    @JoshThunar Your solution would then have to involve writing the parts to another filesystem, or to some remote location. The alternative, to delete the original while creating the archive, would be unsafe.
  • schily
    schily almost 6 years
    So why not using the method I mentioned in my answer? It permits you to capture the parts inside single small files that you may compress while star is either writing the next one or while staris waiting for the media change confirmation from the user-
  • Kusalananda
    Kusalananda almost 6 years
    @schily The reason I don't use star is twofold: 1) You already mentioned in your answer. 2) I'm unfamiliar with it.
  • JoshThunar
    JoshThunar almost 6 years
    Thank you for your answer. However I'm providing the solution for my client's machine so this is why installing anything is very constrained.
  • schily
    schily almost 6 years
    RedHat provides star packages since approx. 20 years and the documentation is better than the documentation for gtar: schilytools.sourceforge.net/man/man1/star.1.html BTW: Do never try to use the multi volume features from gtar, since they create files that cannot be read back by gtarwith a proability of approx. 5%.
  • schily
    schily almost 6 years
    star is the solution of your choice if you like to have a reliable feature enhanced tar implementation. It has less deviations fro tar than gtar and it is easier to learn. I've got feedback from various sysadmins. They told me that they needed a day to understand how to use it but then never would use something else.
  • JoshThunar
    JoshThunar almost 6 years
    @Kusalananda I like the SSH solution since it is really creative, but I'm afraid I won't be able to use it. It would require me to obtain even more powerful machine just for compressing operation.
  • Kusalananda
    Kusalananda almost 6 years
    @JoshThunar The compression would be done by tar on the machine where the data is.