Read txt files (in unicode and utf8) by means of C#

49,421

Solution 1

Edited as UTF8 should support the characters. It seems that you're outputting to a console or a location which hasn't had its encoding set. If so, you need to set the encoding. For the console you can do this

string allText = File.ReadAllText(unicodeFileFullName, Encoding.UTF8);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(allText);

Solution 2

Use the Encoding type Default

File.ReadAllText(unicodeFileFullName, Encoding.Default);

It will fix the ???? Chracters.

Solution 3

When outputting Unicode or UTF-8 encoded multi-byte characters to the console you will need to set the encoding as well as ensure that the console has a font set that supports the multi-byte character in order to display the corresponding glyph. With your existing code a MessageBox.Show(content) or display on a Windows or Web Form would appear correctly.

Have a look at http://msdn.microsoft.com/en-us/library/system.console.aspx for an explanation on setting fonts within the console window.

"Support for Unicode characters requires the encoder to recognize a particular Unicode character, and also requires a font that has the glyphs needed to render that character. To successfully display Unicode characters to the console, the console font must be set to a non-raster or TrueType font such as Consolas or Lucida Console."

As a side note, you can use the FileStream class to read the first three bytes of the file and look for the byte order mark indicator to automatically set the encoding when reading the file. For example, if byte[0] == 0xEF && byte[1] == 0xBB && byte[2] == 0xBF then you have a UTF-8 encoded file. Refer to http://en.wikipedia.org/wiki/Byte_order_mark for more information.

Share:
49,421

Related videos on Youtube

mtkachenko
Author by

mtkachenko

Software Engineer

Updated on July 09, 2022

Comments

  • mtkachenko
    mtkachenko almost 2 years

    I created two txt files (windows notepad) with the same content "thank you - спасибо" and saved them in utf8 and unicode. In notepad they look fine. Then I tried to read them using .Net:

    ...File.ReadAllText(utf8FileFullName, Encoding.UTF8);
    

    and

    ...File.ReadAllText(unicodeFileFullName, Encoding.Unicode);
    

    But in both cases I got this "thank you - ???????". What's wrong?

    Upd: code for utf8

    static void Main(string[] args)
            {
                var encoding = Encoding.UTF8;
                var file = new FileInfo(@"D:\encodes\enc.txt");
                Console.OutputEncoding = encoding;
                var content = File.ReadAllText(file.FullName, encoding);
                Console.WriteLine("encoding: " + encoding);
                Console.WriteLine("content: " + content);
                Console.ReadLine();
            }
    

    Result: thanks ÑпаÑибо

    • user1703401
      user1703401 over 10 years
      The encoding used by Notepad by default is Encoding.Default. Incompatible with your choices. Windows appcompat is legendary, but does get in the way of modern practices. Don't hesitate to whack Notepad over the head by changing the Encoding selection in the combobox. Or use a better text editor that writes a BOM.
  • Darius
    Darius over 10 years
    Please don't use magic numbers in code, extract it as a constant :P
  • mtkachenko
    mtkachenko over 10 years
    But I saved file in utf8 (in notepad it looks normal) and why I can't read it in Encoding.UTF8?
  • keyboardP
    keyboardP over 10 years
    @oblomov - Are you outputting to the console (which is then showing the ???????)? (updated answer)
  • mtkachenko
    mtkachenko over 10 years
    @keyboardP - Yes, to console.
  • keyboardP
    keyboardP over 10 years
    @oblomov - Try setting the OutputEncoding property of the console as shown above.
  • mtkachenko
    mtkachenko over 10 years
    @keyboardP - I did it but instead of 'спасибо' I got some unreadable symbols.
  • keyboardP
    keyboardP over 10 years
    @oblomov - That's strange. I copy/pasted your text into a new Notepad file and saved it as UTF-8. Tried the code above and it works fine. Are you sure you're not accidentally loading the version you saved as Unicode in Notepad?
  • mtkachenko
    mtkachenko over 10 years
    @keyboardP - I checked: file is saved in utf8. My code in first post.
  • Richard Robertson
    Richard Robertson almost 9 years
    The Wiki' article simply points out that Microsoft's software is stupid. They've caused a lot of programmers to think that BOM is part of UTF8 - it's not. I don't mind a down vote for this response since I'm just griping about the extra work I have to do to parse a text file because MS doesn't follow standards. You might be better off providing the link to the WHOLE article: en.wikipedia.org/wiki/UTF-8
  • alireza amini
    alireza amini almost 8 years
    Basically it use your current system Encoding Format . And return what you can see on your systems texts
  • mtkachenko
    mtkachenko almost 8 years
    Different servers can have different Encoding.Default so it's not safe.