How can I create a zip / tgz in Linux such that Windows has proper filenames?

34,216

Solution 1

Currently, tar encodes filenames in UTF

Actually tar doesn't encode/decode filenames at all, It simply copies them out of the filesystem as-is. If your locale is UTF-8-based (as in many modern Linux distros), that'll be UTF-8. Unfortunately the system codepage of a Windows box is never UTF-8, so the names will always be mangled except on tools such as WinRAR that allow the charset used to be changed.

So it is impossible to create a ZIP file with non-ASCII filenames that work across different countries' releases of Windows and their built-in compressed folder support.

It is a shortcoming of the tar and zip formats that there is no fixed or supplied encoding information, so non-ASCII characters will always been non-portable. If you need a non-ASCII archive format you'll have to use one of the newer formats, such as recent 7z or rar. Unfortunately these are still wonky; in 7zip you need the -mcu switch, and rar still won't use UTF-8 unless it detects characters not in the codepage.

Basically it's a horrible mess and if you can avoid distributing archives containing filenames with non-ASCII characters you'll be much better off.

Solution 2

Here is a simple Python script that I've written to unpack tar files from UNIX on Windows:

import tarfile

archive_name = "archive_name.tar"

def recover(name):
    return unicode(name, 'utf-8')

tar = tarfile.open(name=archive_name, mode='r', bufsize=16*1024)
updated = []
for m in tar.getmembers():
    m.name = recover(m.name)
    updated.append(m)

tar.extractall(members=updated)
tar.close()

Solution 3

The problem, using in Linux the default tar (GNU tar), is solved... adding the --format=posix parameter when creating the file.

For example:
tar --format=posix -cf

In Windows, to extract the files, I use bsdtar.

In https://lists.gnu.org/archive/html/bug-tar/2005-02/msg00018.html it is written (since 2005!!):

> I read something in the ChangeLog about UTF-8 being supported. What does
> this mean?
> I found no way to create an archive that would be interchangeable
> between different locales.

When creating archives in POSIX.1-2001 format (tar --format=posix or --format=pax), tar converts file names from the current locales to UTF-8 and then stores them in archive. When extracting, the reverse operation is performed.

P.S. Instead of typing --format=posix you can type -H pax, which is shorter.

Solution 4

I believe you're running into problems with the Zip container format itself. Tar may be suffering from the same problem.

Use the 7zip (.7z) or RAR (.rar) archive formats instead. Both are available for Windows and Linux; the p7zip software handles both formats.

I just tested creating .7z, .rar, .zip, and .tar files on both WinXP and Debian 5, and the .7z and .rar files store/restore filenames correctly while the .zip and .tar files don't. It doesn't matter which system is used to create the test archive.

Solution 5

POSIX-1.2001 specified how TAR uses UTF-8.

As of 2007, changelog version 6.3.0 in the PKZIP APPNOTE.TXT (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) specified how ZIP uses UTF-8.

It's only which tools support these standards properly, that remains an open question.

Share:
34,216

Related videos on Youtube

Murshid Ahmed
Author by

Murshid Ahmed

Updated on September 17, 2022

Comments

  • Murshid Ahmed
    Murshid Ahmed over 1 year

    Currently, tar -zcf arch.tgz files/* encodes filenames in UTF, so Windows users see all characters spoiled in filenames which are not english, and can do nothing with it.

    zip -qq -r arch.zip files/* has the same behavior.

    How can I create a zip / tgz archive so when Windows users extract it will have all filenames encoded properly?

  • Murshid Ahmed
    Murshid Ahmed over 14 years
    Great, thanks! Unfortunately, most users know nothing about 7z, and rar is proprietary :(
  • KiiroSora09
    KiiroSora09 over 14 years
    Yeah, it's a problem. ZIP is by far the most usable solution for users, as all modern OSes have nice native UI support for it. Unfortunately the charset problem is not really solvable today in ZIP (and even in other archive formats it's still troublesome).
  • New2AS3
    New2AS3 over 13 years
    Awesome! this script helped me convert a EUC-JP encoded tar file that was created on an old Solaris server.
  • user1576772
    user1576772 almost 9 years
    Sir, you saved my life. God bless you :)
  • beroal
    beroal almost 7 years
    Thank you for your programs! Regretfully, the Zip program does not work under Python 3, but it works under Python 2.
  • dmitry_romanov
    dmitry_romanov over 6 years
    @beroal, I updated the script. Now it uses the engine developed by Mozilla for Firefox to autodetect the encoding.