Python 2.7: Compressing data with the XZ format using the "lzma" module
I would not be concerned about the differences in the compressed files - depending on the container format and the checksum type used in the .xz
file, the compressed data could vary without affecting the contents.
EDIT I've been looking into this further, and wrote this script to test the PyLZMA Python2.x module and the lzma Python3.x built in module
from __future__ import print_function
try:
import lzma as xz
except ImportError:
import pylzma as xz
import os
# compress with xz command line util
os.system('xz -zkf test.txt')
# now compress with lib
with open('test.txt', 'rb') as f, open('test.txt.xzpy', 'wb') as out:
out.write(xz.compress(bytes(f.read())))
# compare the two files
from hashlib import md5
with open('test.txt.xz', 'rb') as f1, open('test.txt.xzpy', 'rb') as f2:
hash1 = md5(f1.read()).hexdigest()
hash2 = md5(f2.read()).hexdigest()
print(hash1, hash2)
assert hash1 == hash2
This compresses a file test.txt
with the xz
command line utility and with the Python module and compares the results. Under Python3 lzma produces the same result as xz
, however under Python2 PyLZMA produces a different result that cannot be extracted using the xz command line util.
What module are you using that is called "lzma" in Python2 and what command did you use to compress the data?
EDIT 2 Okay, I found the pyliblzma module for Python2. However it seems to use CRC32 as the default checksum algorithm (others use CRC64) and there is a bug that prevents changing the checksum algorithm https://bugs.launchpad.net/pyliblzma/+bug/1243344
You could possibly try compressing using xz -C crc32
to compare the results, but I'm still not having success making a valid compressed file using the Python2 libraries.
Related videos on Youtube
ki2ne
I'm always actively learning as much as I can about programming and computers (mostly. Sometimes, I venture into learning other things whenever I am able), and whatever I learn, I try to tell people about it in the simplest way possible. My programming history are as follows: I'm learning Python right now, with a pretty good background in VB.NET (and formerly VB6. I've used both VB2008 and 2010), HTML (generally speaking, and does not include HTML5 yet), and a good grasp of the basics in C++ and CSS. I aspire to learn some PHP and C# afterwards. As far as my interests and focus within programming goes, I'd have to say I'm very interested in encryption and networking (using sockets to have my applications connect to other computers).
Updated on June 04, 2022Comments
-
ki2ne almost 2 years
I'm experimenting with the lzma module in Python 2.7.6 to see if I could create compressed files using the XZ format for a future project that will make use of it. My code used during the experiment was:
import lzma as xz in_file = open('/home/ki2ne/Desktop/song.wav', 'rb') input_data = in_file.read() compressed_data = xz.compress(input_data) out_file = open('/home/ki2ne/Desktop/song.wav.xz', 'wb') in_file.close() out_file.close()
and I noticed there were two different checksums (MD5 and SHA256) from the resulting file compared to when I used the plain xz (although I could decompress fine with either method - the checksums of the decompressed versions of both files were the same). Would this be a problem?
UPDATE: I found a fix for it by installing the backport (from Python 3.3) via peterjc's Git repository (link here), and now it's showing identical checksums. Not sure if it helps, but I made sure the LZMA Python module in my repository wasn't installed to avoid possible name conflicts.
Here's my test code to confirm this:
# I have created two identical text files with some random phrases from subprocess import call from hashlib import sha256 from backports import lzma as xz f2 = open("test2.txt" , 'rb') f2_buf = buffer(f2.read()) call(["xz", "test1.txt"]) f2_xzbuf = buffer(xz.compress(f2_buf)) f1 = open("test1.txt.xz", 'rb') f1_xzbuf = buffer(f1.read()) f1.close(); f2.close() f1sum = sha256(); f2sum = sha256() f1sum.update(f1_xzbuf); f2sum.update(f2_xzbuf) if f1sum.hexdigest() == f2sum.hexdigest(): print "Checksums OK" else: print "Checksum Error"
I've also verified it using the regular sha256sum as well (when I wrote the data to file).
-
ki2ne about 10 yearsThanks for clarifying. I was about to wonder that if I made .tar.xz archives with Python and distributed them, someone might point it out and might worry that it could be tampered with.
-
ki2ne about 10 yearsI'm using the
python-lzma
package from my distro for the module that I'm using in the script. I've tested compressing the data using the script from my example and by using thexz
command fromxz-utils
package (using a terminal session from outside the script). I've checksummed files from both methods usingmd5sum
andsha256sum
.