How can I create a zip / tgz in Linux such that Windows has proper filenames?
Solution 1
Currently, tar encodes filenames in UTF
Actually tar doesn't encode/decode filenames at all, It simply copies them out of the filesystem as-is. If your locale is UTF-8-based (as in many modern Linux distros), that'll be UTF-8. Unfortunately the system codepage of a Windows box is never UTF-8, so the names will always be mangled except on tools such as WinRAR that allow the charset used to be changed.
So it is impossible to create a ZIP file with non-ASCII filenames that work across different countries' releases of Windows and their built-in compressed folder support.
It is a shortcoming of the tar and zip formats that there is no fixed or supplied encoding information, so non-ASCII characters will always been non-portable. If you need a non-ASCII archive format you'll have to use one of the newer formats, such as recent 7z or rar. Unfortunately these are still wonky; in 7zip you need the -mcu
switch, and rar still won't use UTF-8 unless it detects characters not in the codepage.
Basically it's a horrible mess and if you can avoid distributing archives containing filenames with non-ASCII characters you'll be much better off.
Solution 2
Here is a simple Python script that I've written to unpack tar files from UNIX on Windows:
import tarfile
archive_name = "archive_name.tar"
def recover(name):
return unicode(name, 'utf-8')
tar = tarfile.open(name=archive_name, mode='r', bufsize=16*1024)
updated = []
for m in tar.getmembers():
m.name = recover(m.name)
updated.append(m)
tar.extractall(members=updated)
tar.close()
Solution 3
The problem, using in Linux the default tar
(GNU tar), is solved... adding the --format=posix
parameter when creating the file.
For example:
tar --format=posix -cf
In Windows, to extract the files, I use bsdtar.
In https://lists.gnu.org/archive/html/bug-tar/2005-02/msg00018.html it is written (since 2005!!):
> I read something in the ChangeLog about UTF-8 being supported. What does
> this mean?
> I found no way to create an archive that would be interchangeable
> between different locales.When creating archives in POSIX.1-2001 format (tar --format=posix or --format=pax), tar converts file names from the current locales to UTF-8 and then stores them in archive. When extracting, the reverse operation is performed.
P.S. Instead of typing --format=posix
you can type -H pax
, which is shorter.
Solution 4
I believe you're running into problems with the Zip container format itself. Tar may be suffering from the same problem.
Use the 7zip (.7z
) or RAR (.rar
) archive formats instead. Both are available for Windows and Linux; the p7zip
software handles both formats.
I just tested creating .7z
, .rar
, .zip
, and .tar
files on both WinXP and Debian 5, and the .7z
and .rar
files store/restore filenames correctly while the .zip
and .tar
files don't. It doesn't matter which system is used to create the test archive.
Solution 5
POSIX-1.2001 specified how TAR uses UTF-8.
As of 2007, changelog version 6.3.0 in the PKZIP APPNOTE.TXT (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) specified how ZIP uses UTF-8.
It's only which tools support these standards properly, that remains an open question.
Related videos on Youtube
Murshid Ahmed
Updated on September 17, 2022Comments
-
Murshid Ahmed over 1 year
Currently,
tar -zcf arch.tgz files/*
encodes filenames in UTF, so Windows users see all characters spoiled in filenames which are not english, and can do nothing with it.zip -qq -r arch.zip files/*
has the same behavior.How can I create a zip / tgz archive so when Windows users extract it will have all filenames encoded properly?
-
Murshid Ahmed over 14 yearsGreat, thanks! Unfortunately, most users know nothing about 7z, and rar is proprietary :(
-
KiiroSora09 over 14 yearsYeah, it's a problem. ZIP is by far the most usable solution for users, as all modern OSes have nice native UI support for it. Unfortunately the charset problem is not really solvable today in ZIP (and even in other archive formats it's still troublesome).
-
New2AS3 over 13 yearsAwesome! this script helped me convert a EUC-JP encoded tar file that was created on an old Solaris server.
-
user1576772 almost 9 yearsSir, you saved my life. God bless you :)
-
beroal almost 7 yearsThank you for your programs! Regretfully, the Zip program does not work under Python 3, but it works under Python 2.
-
dmitry_romanov over 6 years@beroal, I updated the script. Now it uses the engine developed by Mozilla for Firefox to autodetect the encoding.