Archive big data into multiple parts
If you have enough space to store the compressed archive, then the archive could be created and split in one go (assuming GNU split
):
tar -c -vz -f - directory | split --additional-suffix=.gz.part -b 1G
This would create files called xaa.gz.part
, xab.gz.part
etc., each file being a 1G compressed bit of the tar
archive.
To extract the archive:
cat x*.gz.part | tar -x -vz -f -
If the filesystem can not store the compressed archive, the archive parts needs to be written to another filesystem, alternative to some remote location.
On that remote location, for example:
ssh user@serverwithfiles tar -c -vz -f - directory | split --additional-suffix=.gz.part -b 1G
This would transfer the compressed archive over ssh
from the machine with the big directory to the local machine and split it.
Related videos on Youtube
JoshThunar
Updated on September 18, 2022Comments
-
JoshThunar over 1 year
I'm working on big data and I need to archive a directory that is larger than 64 terabytes. I cannot create such large file (archive) on my file system. Unluckily, all proposed solutions for creating a multiple-parts archive on Linux suggest creating an archive first and then splitting it into smaller files with
split
command.I know that it is possible with f.e. 7zip, but unluckily I'm quite forced to use tools built in RedHat 6 - tar, gzip, bzip2...
I was wondering about creating a script that would ask user for the maximum volume size. It would archive every single file with gzip, split those files, that are too big and then manually merge them into many tars with the chosen size. Is that a good idea?
Is there any other possibility to achieve big archive division with basic Linux commands?
UPDATE:
I've tested the solution on the filesystem with the restricted maximum file size and it worked. The pipe that redirects the
tar
output directly intosplit
command has worked as intended:tar -czf - HugeDirectory | split --bytes=100GB - MyArchive.tgz.
The created files are already small and when merging them together no supersized files are created:
cat MyArchive.tgz* | tar -xzf -
-
ajeh almost 6 yearsI am confused: are you trying to compress a single 64+ TB file into
.tar.gz
? Why do you feel you need.tar
in the picture then?.gz
should be perfectly fine, and then you canman split
if you need multiple files.
-
-
Kusalananda almost 6 years@JoshThunar Your solution would then have to involve writing the parts to another filesystem, or to some remote location. The alternative, to delete the original while creating the archive, would be unsafe.
-
schily almost 6 yearsSo why not using the method I mentioned in my answer? It permits you to capture the parts inside single small files that you may compress while
star
is either writing the next one or whilestar
is waiting for the media change confirmation from the user- -
Kusalananda almost 6 years@schily The reason I don't use
star
is twofold: 1) You already mentioned in your answer. 2) I'm unfamiliar with it. -
JoshThunar almost 6 yearsThank you for your answer. However I'm providing the solution for my client's machine so this is why installing anything is very constrained.
-
schily almost 6 yearsRedHat provides
star
packages since approx. 20 years and the documentation is better than the documentation forgtar
: schilytools.sourceforge.net/man/man1/star.1.html BTW: Do never try to use the multi volume features fromgtar
, since they create files that cannot be read back bygtar
with a proability of approx. 5%. -
schily almost 6 years
star
is the solution of your choice if you like to have a reliable feature enhanced tar implementation. It has less deviations frotar
thangtar
and it is easier to learn. I've got feedback from various sysadmins. They told me that they needed a day to understand how to use it but then never would use something else. -
JoshThunar almost 6 years@Kusalananda I like the SSH solution since it is really creative, but I'm afraid I won't be able to use it. It would require me to obtain even more powerful machine just for compressing operation.
-
Kusalananda almost 6 years@JoshThunar The compression would be done by
tar
on the machine where the data is.