How to specify character encoding for 7z?

24,787

Solution 1

Depending on the encoding used to create the zip file, you might be able to prevent unwanted translations by temporarily setting the locale to "C":

LC_ALL=C 7z x $archive

(This helped for a zip created by IZArc on Win7, using two of your example filenames.)

However, for the archive in the question, the "filename" field contains the CP1251 encoding of "ДКП.doc" (84 8a 8f 2e 64 6f 63). The "extra" field uses an Info-zip extension (see section 4.6.9 of the Zip Specification v 6.3.4 ) to store the UTF-8 filename. unzip knows about this header, and uses the UTF-8 name, ignoring the CP1251 one.

7z doesn't do anything with this "extra field", and only uses the CP1251 one. Depending on the current locale, it might create the file using that exact name (the raw bytes 84 8a 8f), or worse, treat them as unicode points to be expanded to UTF-8 first (c2 84 c2 8a c2 8f).

One option is to use external utilities to change the zip first:

#!/bin/bash

cp orig.zip renamed.zip

index=0
zipinfo -1 orig.zip | while read name ; do
        ziptool renamed.zip rename $index "$name"
        index=$((index+1))
done

ziptool is from libzip. zipinfo is distributed with Info-ZIP's UnZip, so you might as well have just used unzip.

Solution 2

Checked P7ZIP 16.02 [64-bit] (Xubuntu /Ubuntu 18.04.2 LTS)

Cyrillic characters are still a problem:

  • Can unpack - content is OK, name when unpacked is corrupted.
  • Cannot easily work with the files, when they are not yet archived.

enter image description here

Solution 3

OEM code page selection is broken in both vanilla unzip and vanilla p7zip. I made a patch fixing this issue and there is ppa for Ubuntu with p7zip with this patch applied. Those patches also support manual code page setting with OEMCP env variable.

Share:
24,787

Related videos on Youtube

pusheax
Author by

pusheax

Updated on September 18, 2022

Comments

  • pusheax
    pusheax almost 2 years

    Doing 7z x on an archive gives me

    '20 ª.1 ¯® '$'\302\212''¨à®¢®£à ¤áª ï ã«.rtf'  IMG_6527.JPG
    ''$'\302\212''¨à®¢®£à ¤áª ï, ¨áâ.doc'          IMG_6532.JPG
    ''$'\302\204''®¯  ᮣ« è¥­¨¥(3).doc'           IMG_6542.JPG
    ''$'\302\204\302\212\302\217''.doc'        IMG_6543.JPG IMG_6526.JPG
    

    Clearly some files were encoded differently and 7z by default does not convert to UTF-8. How to tell 7z to do the conversion?

    The only options I found for charset:

    -scc{UTF-8|WIN|DOS}: set charset for for console input/output
    -scs{UTF-8|UTF-16LE|UTF-16BE|WIN|DOS|{id}}: set charset for list files

    WIN, DOS, UTF-8 do not work. When trying to guess charset via

    7z -scsCP1251 l 26-08-2016_10-18-14.zip
    

    7z gives warning:

    Unsupported charset: cp1251
    

    unzip does this right (cyrillic symbols got converted):

    '20 к.1 по Кировоградская ул.rtf'  IMG_6532.JPG  'Доп  соглашение(3).doc'
    26-08-2016_10-18-14.zip        IMG_6542.JPG  'Кировоградская, ист.doc'
    IMG_6526.JPG               IMG_6543.JPG
    IMG_6527.JPG               ДКП.doc
    

    Supplementary information

    • p7zip Version:
      15.14.1 (locale=ru_RU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs AMD Phenom(tm) II X4 960T Processor (100FA0),ASM)
      
    • hexdump of start of archive (od -tx1z -Ax):
      000000 50 4b 03 04 14 00 00 00 00 00 81 54 1a 49 7e 35  >PK.........T.I~5<
      000010 fa 34 00 ec 00 00 00 ec 00 00 07 00 17 00 84 8a  >.4..............<
      000020 8f 2e 64 6f 63 75 70 13 00 01 19 fd 45 54 d0 94  >..docup.....ET..<
      000030 d0 9a d0 9f 2e 64 6f 63 00 00 00 00 d0 cf 11 e0  >.....doc........<
      000040 a1 b1 1a e1 00 00 00 00 00 00 00 00 00 00 00 00  >................<
      000050 00 00 00 00 3e 00 03 00 fe ff 09 00 06 00 00 00  >....>...........<
      000060 00 00 00 00 00 00 00 00 01 00 00 00 71 00 00 00  >............q...<
      000070 00 00 00 00 00 10 00 00 73 00 00 00 01 00 00 00  >........s.......<
      000080 fe ff ff ff 00 00 00 00 70 00 00 00 ff ff ff ff  >........p.......<
      000090 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  >................<
      *
      000230 ff ff ff ff ff ff ff ff ff ff ff ff ec a5 c1 00  >................<
      000240 07 80 19 04 00 00 f0 12 bf 00 00 00 00 00 00 10  >................<
      000250 00 00 00 00 00 08 00 00 72 7b 00 00 0e 00 62 6a  >........r{....bj<
      000260 62 6a 2a 16 2a 16 00 00 00 00 00 00 00 00 00 00  >bj*.*...........<
      000270 00 00 00 00 00 00 00 00 19 04 16 00 34 8e 00 00  >............4...<
      000280 48 7c 00 00 48 7c 00 00 4b 2c 00 00 00 00 00 00  >H|..H|..K,......<
      000290 19 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
      0002a0 00 00 00 00 00 00 00 00 ff ff 0f 00 00 00 00 00  >................<
      0002b0 00 00 00 00 ff ff 0f 00 00 00 00 00 00 00 00 00  >................<
      0002c0 ff ff 0f 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
      0002d0 00 00 00 00 b7 00 00 00 00 00 3e 0e 00 00 00 00  >..........>.....<
      0002e0 00 00 3e 0e 00 00 a0 1b 00 00 00 00 00 00 a0 1b  >..>.............<
      0002f0 00 00 00 00 00 00 a0 1b 00 00 00 00 00 00 a0 1b  >................<
      000300 00 00 00 00 00 00 a0 1b 00 00 14 00 00 00 00 00  >................<
      000310 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 b4 1b  >................<
      000320 00 00 00 00 00 00 b4 1b 00 00 00 00 00 00 b4 1b  >................<
      000330 00 00 38 00 00 00 ec 1b 00 00 84 00 00 00 70 1c  >..8...........p.<
      000340 00 00 34 00 00 00 b4 1b 00 00 00 00 00 00 b8 28  >..4............(<
      000350 00 00 e6 01 00 00 a4 1c 00 00 00 00 00 00 a4 1c  >................<
      000360 00 00 00 00 00 00 a4 1c 00 00 00 00 00 00 a4 1c  >................<
      000370 00 00 00 00 00 00 a4 1c 00 00 00 00 00 00 d8 1d  >................<
      000380 00 00 00 00 00 00 d8 1d 00 00 00 00 00 00 d8 1d  >................<
      000390 00 00 00 00 00 00 43 28 00 00 02 00 00 00 45 28  >......C(......E(<
      0003a0 00 00 00 00 00 00 45 28 00 00 00 00 00 00 45 28  >......E(......E(<
      *
      0003c0 00 00 00 00 00 00 45 28 00 00 00 00 00 00 9e 2a  >......E(.......*<
      0003d0 00 00 a2 02 00 00 40 2d 00 00 da 00 00 00 45 28  >[email protected](<
      0003e0 00 00 2d 00 00 00 00 00 00 00 00 00 00 00 00 00  >..-.............<
      0003f0 00 00 00 00 00 00 a0 1b 00 00 00 00 00 00 d8 1d  >................<
      000400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
      000410 00 00 00 00 00 00 d8 1d 00 00 00 00 00 00 d8 1d  >................<
      000420
      
    • Admin
      Admin almost 8 years
      It's a zip archive created on Windows 7 by different person. He sent it to me for printing.
  • pusheax
    pusheax almost 8 years
    This does not work. Gives me the same result. I suspect file paths were stored with CP1251 charset. But you're using UTF-8.
  • VocalFan
    VocalFan almost 8 years
    This answer works for zips that store only UTF-8; the hexdump shows that the OP's zip contains the CP1251 encoding, followed by the UTF-8 encoding using Info-zip extensions (see pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT 4.6.9).
  • pusheax
    pusheax almost 8 years
    So then the question is how to make 7z use utf-8 version of filename? Is it not capable?
  • VocalFan
    VocalFan almost 8 years
    @pusheax Not with current 7z, no. Added a possible workaround, although by that point, it's simpler to stick with unzip.
  • pusheax
    pusheax almost 8 years
    I think you're right. That is unfortunate. Always thought 7z is superior.
  • VocalFan
    VocalFan almost 8 years
    @pusheax The 7z format has its advantages, and the 7z tool is well-suited for handling that. But it's not the best for other formats and their vendor-specific extensions.