German letters and encoding in C#

13,927

Solution 1

Try CodePage 850 (has worked for me):

using (ZipArchive archive = System.IO.Compression.ZipFile.Open(ZipFile, ZipArchiveMode.Read,  System.Text.Encoding.GetEncoding(850)))
{
      // ....

The next comment is from (an ancient version) of Sharpziplib that put me in the right direction:

    /* Using the codepage 1252 doesn't solve the 8bit ASCII problem :/
       any help would be appreciated.

      // get encoding for latin characters (like ö, ü, ß or ô)
      static Encoding ecp1252 = Encoding.GetEncoding(1252);
    */

    // private static Encoding _encoding = System.Text.ASCIIEncoding;
    private static Encoding _encoding = System.Text.Encoding.GetEncoding(850);

The last line is my change, to made it correctly read zip-files with special characters.

Solution 2

First of all the only official (not existing...) ZIP format does not allow Unicode characters (then you can't use any encoding other than ASCII).

That said many tools and libraries allow you to use different encoding but it may fail (for example if you try to decode forcing UTF8/UTF32 or whatever a file encoded with another encoding).

If file name is encoded in ASCII it'll get the code page of your system:

For entry names that contain only ASCII characters, the language encoding flag is set, and the current system default code page is used to encode the entry names.

You have not such big control with .NET classes about this topic. But if you do not specify an encoding you'll get default behavior (UTF8 for codes outside ASCII and current code page for ASCII). Most of times it works (if both encoding and decoding has been done within same code page).

How to avoid this? It's not easy (because we lack of a standard) but to summarize:

  • Do not force encoding (unless you're consuming zip file you zipped then with a known encoding).
  • Default behavior is pretty good in most of cases.
  • For ASCII encoded ZIPs with extended characters rely on system code page (it must be the same in both systems).
  • Provide a way to the user to change encoding (you can't check what's encoding used by zip utility and there is no standard about this). It means not only to change encoding (UTF8/UTF16 or whatever) but code page too (in case they doesn't match). GetEncoding function will give you right encoder for code page you specify).

Best hint I can give you? Rely on default behavior (it's pretty common) but provide a way for your users to change it if you need to be compatible with most of ZIPs out there (because each one may be implemented in a different way), not only for encoding but for code page too. Especially do not force it from code with German specific code page because it'll break with first Spanish/French/Italian/Dutch file you'll handle (and there is not a common code page for them).

BTW be ready to handle various exceptions if you open a file with wrong encoding (not code page).

Editing for future readers (from comments): CP 850 catches most of common Western Europe characters but it's not The Code Page for Europe. Compare it, for example, with East Europe languages or with Norwegian. It doesn't match them (and in that languages characters outside 33-127 range are pretty pretty common because they're not box drawing). Some characters from CP 850 (Ê Ë ı for example) are not available in (let's say) CP 865 (for Norsk language).

Let me explain with an example. You have a file name (from Trukey) with this name: "Garip Dosya Adı.txt". Last character has code 141 on CP 857 (for Turkey). If you're using CP 850 you'll get ì instead of ı because in the original CP 850 it has code 213. I won't even mention far east languages (because a fixed code page will make a messy even if you're limited to Europe). This is the reason you can't set a fixed code page unless you're writing a small utility for your own use.

Solution 3

I used the following libraries:

using System.IO;
using System.Text;

with Encoding.Latin1 in the following method:

File.ReadAllLinesAsync(filePath, Encoding.Latin1, cancellationToken);

which worked in my case.

Share:
13,927
eMizo
Author by

eMizo

Updated on June 04, 2022

Comments

  • eMizo
    eMizo almost 2 years

    I have an unzipping function, and I am using System.Text.Encoding to make sure that the files that are being extracted keep the same names after extraction because usually the files that I am unzipping contains German letters.
    I tried different things like Encoding.Default or Encoding.UTF8 but nothing works äÄéöÖüß.txt gets converted to „Ž‚”™á.txt or in case of default it is black boxes :/

    any suggestions?

    using (ZipArchive archive = System.IO.Compression.ZipFile.Open(ZipFile, ZipArchiveMode.Read, System.Text.Encoding.Default))
    {
    
        foreach (ZipArchiveEntry entry in archive.Entries)
        {
            string fullPath = Path.Combine(appPath, entry.FullName);
            if (String.IsNullOrEmpty(entry.Name))
            {
                Directory.CreateDirectory(fullPath);
            }
            else
            {
                if (!entry.Name.Equals("Updater.exe"))
                {
                    entry.ExtractToFile(fullPath,true);
    
                }
            }
        }
    }
    
  • eMizo
    eMizo over 10 years
    Thanks for your explanation :) really useful especially that I have so little knowledge regarding this topic :) thanks again !
  • eMizo
    eMizo over 10 years
    One last question, in my case the codepage 850 worked, but the default didn't actually give me the solution in my case, would it be so bad to keep on the 850?
  • Adriano Repetti
    Adriano Repetti over 10 years
    @eMizo of course it is REALLY bad (unless you're writing a small utility for your personal use). Page 850 contains most of common characters used in western languages but it's not the default 437 DOS code page (which in theory ZIP format adheres). It means that you may open most of files with German characters but it'll fail with perfectly valid ZIP files and it's a completely different thing with the Windows 1252 code page (many many common use characters don't match).
  • Adriano Repetti
    Adriano Repetti over 10 years
    Just few notes: page 850 and 1252 are something completely different (and absolutely not interchangeable). Forcing code page will break compatibility with existing ZIP files (made God knows in which country) and even with perfectly valid ZIP files (made using default 437 code page). This may help him to open files made on his machine and with a specific zip utility but it'll fail with 99% of other ZIPs out there (encoded with 437, 1252 or UTF8).
  • Adriano Repetti
    Adriano Repetti over 10 years
    @eMizo moreover it works only with ZIP files from your machine with your specific zip utility. Another utility may always encode UTF8 (for example) and it'll fail. Another one may use 1252 code page. Another one may rely on system default and UTF8 (default framework behavior) and so on...if you force it you limit your utility to handle only a very restricted set of them...
  • user1149201
    user1149201 over 10 years
    The 1252 was mentioned in the SharpZipLib, but never used. I've used 850, and never got any problem reading zip files from other applications. I removed the reference to 1252 from the first sentence in my answer.
  • user1149201
    user1149201 over 10 years
    Codepage 850 works perfectly on my machine with .zip files received from other machines in other companies in other (European) countries. I haven't tested 437, but that's mainly because I don't expect many box-drawing characters to appear in filenames.
  • Adriano Repetti
    Adriano Repetti over 10 years
    @GvS of course because cp 850 catches most of common Western Europe characters. Compare it, for example, with East Europe languages or with Norwegian. It doesn't match them (and in that languages characters outside 33-127 range are pretty pretty common because they're not box drawing...). Some characters from cp 850 (ÊËı for example) are not available in (let's say) cp 865 (for Norks language).
  • Adriano Repetti
    Adriano Repetti over 10 years
    Let me explain. You have a file name (from Trukey) with this name: "Garip Dosya Adı.txt". Last character has code 141 on CP 857 (for Turkey). If you're using CP 850 you'll get ì instead of ı because in the original CP 850 it has code 213. I won't even mention far east languages (because a fixed code page will make a messy even if you're limited to Europe). CP tables are consistent only...as much as possible.
  • Max
    Max over 4 years
    CodePage 850 worked for me with WinZip 22.0 64 bit. Thank you using (ZipArchive archive = ZipFile.Open(Target.FullName, ZipArchiveMode.Update, Encoding.GetEncoding(850)))