Why are tar.xz files 15x smaller when using Python's tar library compared to macOS tar?
Solution 1
Short answer: yes, it is safe to use Python tarlib
to compress the data, nothing is lost compared to BSD tar
.
Underlying issue: sorting
I think the underlying issue is that BSD tar
and GNU tar
without any sort options put the files in the archive in an undefined order.
GNU tar
has a --sort
option:
sort directory entries according to
ORDER
, which is one ofnone
,name
, orinode
.
The default is--sort=none
, which stores archive members in the same order as returned by the operating system.
Testing GNU tar
To test this I installed GNU tar
on my Mac with:
brew install gnu-tar
And then tarred the same folder, but with the --sort
option:
gtar --sort='name' -cJf zsh-archive-sorted.tar.xz /Users/user/Desktop/temp/tar/2021-03-11
The zsh-archive-sorted.tar.xz
archive is 1.5 MB, equal to the size of the archive created by the Python library.
Concatenating in sorted order
The effect sorting has on the final archive size is further demonstrated by first concatenating all the JSON files sorted by name (which has the creation unixtime at the beginning of it) and then tarring with BSD tar
:
cat *.json > all.txt
tar cJf zsh-cat-archive.tar.xz all.txt
The zsh-cat-archive.tar.xz
archive is also 1.5 MB.
Python tarfile
sorting
Finally, the documentation of the Python TarFile.add
function confirms that Python tarfile
sorts by default:
Directories are added recursively by default. This can be avoided by setting recursive to False. Recursion adds entries in sorted order.
Why sorting matters
I think the reason sorting has such an impact in my case is as follows:
My JSON files contain locations of hundreds of vehicles. Every minute I read out all the locations, but only a few of these locations have a different value from minute to minute.
By sorting the files by name, two subsequent files have little different characters between them.
Apparently this is very favourable for the compression efficiency.
Solution 2
Try setting the compression levels in the macOS command line.
I know you are asking about xz
but explained in this answer here, on older versions of GZip you can set the compression level with an environment variable like this:
GZIP=-9 tar cf zsh-archive.tar.xz folderpath
That said, that only seems to work with GZip 1.8 and is depreciated on later versions. So use the -I
/--use-compress-program=COMMAND
option for tar instead; note this option might not work on macOS but placing here anyway just in case. So the command would then change to:
tar -I 'gzip -9' -cf zsh-archive.tar.xz folderpath
And yes, these examples would be compressing the archive Gzip instead of xz
, but you can easily change the command to this to use xz
like this:
tar -I 'xz -9' -cf zsh-archive.tar.xz folderpath
The xz
compression level ranges from -0
to -9
with the default being -6
; so -9
is the highest compression level.
Just note that xz
is not installed on macOS by default. To install it on macOS you must first install Homebrew and then install xz
via Homebrew like this:
brew install xz
Solution 3
Makes me wonder what Python is using for compression
It's probably using the function calls in liblzma. Tar is probably piping through the xz shell command.
A quick comment on --sort=name
:
The sort option is a relatively recent enhancement to GNU tar and was introduced in tar version 1.28.
It may never be implemented in BSD tar.
Related videos on Youtube
Saaru Lindestøkke
Updated on September 18, 2022Comments
-
Saaru Lindestøkke over 1 year
Context
I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the
tar
command and Python's built-intarfile
library on macOS or Raspbian 10 (Buster)Minimal working example
This script compares both methods:
#!/usr/bin/env python3 from pathlib import Path from subprocess import call import tarfile fullpath = Path("/Users/user/Desktop/temp/tar/2021-03-11") zsh_out = Path(fullpath.parent, "zsh-archive.tar.xz") py_out = Path(fullpath.parent, "py-archive.tar.xz") # tar using terminal # tar cJf zsh-archive.tar.xz folderpath call(["tar", "cJf", zsh_out, fullpath]) # tar using tarfile library with tarfile.open(py_out, "w:xz") as tar: tar.add(fullpath, arcname=fullpath.stem) # Print filesizes print(f"zsh tar filesize: {round(Path(zsh_out).stat().st_size/(1024*1024), 2)} MB") print(f"py tar filesize: {round(Path(py_out).stat().st_size/(1024*1024), 2)} MB")
The output is:
zsh tar filesize: 23.7 MB py tar filesize: 1.49 MB
The versions I use are as follows:
-
tar
on macOS:bsdtar 3.3.2 - libarchive 3.3.2 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.6
-
tar
on Raspbian 10:xz (XZ Utils) 5.2.4 liblzma 5.2.4
-
tarfile
Python library:0.9.0
Things I've tried
After compression, I've extracted both archives and compared the resulting folder with:
diff -r py-archive-expanded zsh-archive-expanded
There was no difference.
If I compare the two tar archives directly, they seem different:
➜ diff zsh-archive.tar.xz py-archive.tar.xz Binary files zsh-archive.tar.xz and py-archive.tar.xz differ
If I inspect the archives with Quicklook (and the Betterzip plugin) I see that the files in the archive are ordered in a different way:
Left is
zsh-archive.tar.xz
, right ispy-archive.tar.xz
:The zsh archive uses an unknown order, and the Python archive orders the file by modification date. I am not sure if that matters.
Question
What is going on? Am I losing something by using the Python library to compress my data? Is the 15-fold difference in size an indicator of some issue? Or can I safely go ahead and use the efficient Python implementation?
-
Admin about 3 yearsDid you make sure the result of
tar cJf
is actuallyxz
-compressed?xz
also uses LZMA but it is a distinct format from, say, 7-zip. Tryfile the-archive.tar.xz
. -
Admin about 3 years
file zsh-archive.tar.xz
giveszsh-archive.tar.xz: XZ compressed data
-
Admin about 3 yearsDid you actually tar up the same directory tree in both cases? Just making sure ;-)
-
Admin about 3 yearsHm, okay. Please verify whether the uncompressed
.tar
files are the same. Files may have been added in a different order, which creates a different compression result. -
Admin about 3 years@tink, yes I do. I've added a testscript in my question that shows the same directory being compressed generating the wildly different filesize.
-
Admin about 3 years@DanielB
.tar
files seem not the same, and when I quicklook them (with the Betterzip plugin) I do see a difference in file-order: the py-archive has the files in order of creation, while the zsh-archive has an unclear ordering. -
Admin about 3 yearsIt is not used here, but 7-Zip is extremely efficient if there are repeated sequences in different files (that is, it doesn't just compress each file separately). I have experienced compression ratios of 1:50 on some data. That may very well be the case with the kind of JSON data you have.
-
Admin about 3 years@PeterMortensen, not sure what you mean by your comment. Do you mean that 7-zip could do even better? In my situation xz seems to compress repeated sequences in different files with a ratio of 1:800, right?
-
Admin about 3 yearsDo you get the same result if you tar compress a single large file (thus removing issues of order from consideration)?
-
Admin about 3 years@JacobLee yes, I've updated my answer to include this.
-
-
Saaru Lindestøkke about 3 yearsI tried the command
tar -I 'xz -9' -cf zsh-archive.tar.xz folderpath
, but I get the following error:tar: Couldn't open xz -9: No such file or directory
-
Giacomo1968 about 3 yearsIn macOS? I busted checked and it seems to be provided on my system by Homebrew. So I would recommend installing Homebrew and then running:
brew install xz
-
Saaru Lindestøkke about 3 yearsYes, on macOS.
man tar
shows the-I
option is a synonym for the-T
option, which is the--files-from
option. I've tried it with the longhand option--use-compress-program
which resulted in a 10 MB file, instead of the regular 23 MB, but it's still not near the 1.5 MB from Python. -
Saaru Lindestøkke about 3 yearsNote that I've tried this in the raspbian terminal as well, with similar results to what I get on macOS.
-
Giacomo1968 about 3 yearsAll fair. Makes me wonder what Python is using for compression then?
-
RonJohn about 3 yearsCompression programs operate on blocks of text controlled by a single dictionary; by sorting the input, you've put similar bits near each other, allowing
xz
to compress lots of similar data in one dictionary. Compression and decompression was probably also faster. -
justhalf about 3 yearsWow, another case where sorting makes things much faster.
-
Saaru Lindestøkke about 3 yearsI don't really understand yet why the OS returns the files in "unsorted" order with the sort=none option. I mean, there's always some sort order, right? If anyone knows what order the OS uses feel free to add.
-
Bakuriu about 3 years@SaaruLindestøkke The order in which the OS returns the files in a directory depends on the filesystem used (assuming the same OS is used, obviously you can easily patch linux so that it will return files in some order you want by default or that it will randomize the order by default). So there is no single sort order used by default by any OS, as such we do not provide guarantees and we say "do not assume any specific sort order", this does not mean that filesystems actively randomize the results before returning them, it just means if the user changes fs the results will likely change
-
Peter Cordes about 3 yearsSome older filesystems didn't keep on-disk directory metadata sorted or even indexed, and just had to linear search when looking up a name in a directory. Modern systems typically now use B-trees or have some kind of index. If it's just an index separate from how the name->inode mapping itself is stored, then the on-disk order might not be particularly sorted. (Reading directory entries from the OS with a Linux
getdents(2)
system call, or whatever underlying mechanism thatreaddir(3)
uses on other OSes, will typically get them in disk order, or whatever order they're in in FS cache) -
Peter Cordes about 3 yearsTL:DR: "unsorted" means use dir entries in the order we get them from the OS's system call, which you can see with
ls -U
. -
jiwopene about 3 years@SaaruLindestøkke, the OS and the underlying filesystem usually returns directory entries in the most efficient way. There is not usually significant difference between retrieving sorted and unsorted directory entry lists but when you have directory with large number of files (thousands, millions, …), it makes great difference. Try creating directory with 10⁶ files and retrieving its contents using
ls -f
(list, unsorted), then deleting and recreating it (to clear cache, if there is one), and listing its entries with sort usingls
without the-f
option. Usetime
to view processing time. -
Z4-tier about 3 years@justhalf sorting is making the archive smaller, but seems unlikely to speed up the process.
-
Giacomo1968 about 3 yearsWowow! You know, this makes so much sense on such a basic level I commend you for discovering this.The idea that sorting text files in some way would improve compression seems so damned obvious when stated, but not obvious if one has not had experience with it. Excellent answer!
-
Giacomo1968 about 3 yearsThis answer is interesting. But it is weird that the focus of this answer is not the question, but rather a comment I made to my own answer just musing about why this happened.
-
Michael J. Evans about 3 yearsbsdtar can be provided a sort list of files to include to work around this issue. As can every other tar program I've ever used (though the syntax might differ).
-
Nemo about 3 yearsGood point on recency. 7z (p7zip) has for many years sorted files (by type) to improve compression; it's nice to see some basic version of this in tar, which is much more compatible with many other things.
-
Peter - Reinstate Monica about 3 years@Giacomo1968 It doesn't make immediate sense to me: The proximity of similar patterns is not per se a requirement for knowing them: It is probably a function of the dictionary size, see e.g. superuser.com/questions/616785/…
-
Romuald Brunet about 3 yearsNote that the python is actually sorting those files on purpose (so that 2 runs with the same directory contents will always output the same tar file)
-
Olivier Dulac about 3 years@MichaelJ.Evans : and when one use (even a very old non gnu) : tar ... *something # the shell expands *something and it result in an ordered list of files that tar will include sequentially in that (sorted by the shell) order. It should uses LC_Collate order of the shell in which you launch that tar command, by default
-
Peter Cordes about 3 years@Peter-ReinstateMonica: My understanding is that the compression dictionary can adapt can use even smaller symbols when the similar stuff is all together, because it doesn't have to distinguish it from the other possibilities. Or encode larger offsets to distant similar pieces. But yeah, it could just be a limit on the size of the compression dictionary; the OP mentions the total size is 1.3GB, so unless you set xz to make archives that would require the decompressor to use nearly that much memory, it couldn't see the whole thing at once.
-
Peter - Reinstate Monica about 3 years@PeterCordes Yeah, I'd be curious what an experiment showed but I'm too busy right now to do it myself.
-
Saaru Lindestøkke about 3 years@Giacomo1968 I just now realized that the
-I
option is GNUtar
only. The fact that it was missing on my BSDtar
should've been an indicator that something was off. -
Giacomo1968 about 3 years@SaaruLindestøkke Well, you did great detective work otherwise. So hey!
-
Anand Bhararia about 3 yearsinb4 pngcrush et. al. are adapted for this, and we have to wait a week for optimally sized archives to be produced.
-
phuclv about 3 years