Read txt files (in unicode and utf8) by means of C#
Solution 1
Edited as UTF8
should support the characters. It seems that you're outputting to a console or a location which hasn't had its encoding set. If so, you need to set the encoding. For the console you can do this
string allText = File.ReadAllText(unicodeFileFullName, Encoding.UTF8);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(allText);
Solution 2
Use the Encoding type Default
File.ReadAllText(unicodeFileFullName, Encoding.Default);
It will fix the ????
Chracters.
Solution 3
When outputting Unicode or UTF-8 encoded multi-byte characters to the console you will need to set the encoding as well as ensure that the console has a font set that supports the multi-byte character in order to display the corresponding glyph. With your existing code a MessageBox.Show(content) or display on a Windows or Web Form would appear correctly.
Have a look at http://msdn.microsoft.com/en-us/library/system.console.aspx for an explanation on setting fonts within the console window.
"Support for Unicode characters requires the encoder to recognize a particular Unicode character, and also requires a font that has the glyphs needed to render that character. To successfully display Unicode characters to the console, the console font must be set to a non-raster or TrueType font such as Consolas or Lucida Console."
As a side note, you can use the FileStream class to read the first three bytes of the file and look for the byte order mark indicator to automatically set the encoding when reading the file. For example, if byte[0] == 0xEF && byte[1] == 0xBB && byte[2] == 0xBF then you have a UTF-8 encoded file. Refer to http://en.wikipedia.org/wiki/Byte_order_mark for more information.
Related videos on Youtube
Comments
-
mtkachenko almost 2 years
I created two txt files (windows notepad) with the same content "thank you - спасибо" and saved them in utf8 and unicode. In notepad they look fine. Then I tried to read them using .Net:
...File.ReadAllText(utf8FileFullName, Encoding.UTF8);
and
...File.ReadAllText(unicodeFileFullName, Encoding.Unicode);
But in both cases I got this "thank you - ???????". What's wrong?
Upd: code for utf8
static void Main(string[] args) { var encoding = Encoding.UTF8; var file = new FileInfo(@"D:\encodes\enc.txt"); Console.OutputEncoding = encoding; var content = File.ReadAllText(file.FullName, encoding); Console.WriteLine("encoding: " + encoding); Console.WriteLine("content: " + content); Console.ReadLine(); }
Result: thanks ÑпаÑибо
-
user1703401 over 10 yearsThe encoding used by Notepad by default is Encoding.Default. Incompatible with your choices. Windows appcompat is legendary, but does get in the way of modern practices. Don't hesitate to whack Notepad over the head by changing the Encoding selection in the combobox. Or use a better text editor that writes a BOM.
-
-
Darius over 10 yearsPlease don't use magic numbers in code, extract it as a constant :P
-
mtkachenko over 10 yearsBut I saved file in utf8 (in notepad it looks normal) and why I can't read it in Encoding.UTF8?
-
keyboardP over 10 years@oblomov - Are you outputting to the console (which is then showing the
???????
)? (updated answer) -
mtkachenko over 10 years@keyboardP - Yes, to console.
-
keyboardP over 10 years@oblomov - Try setting the
OutputEncoding
property of the console as shown above. -
mtkachenko over 10 years@keyboardP - I did it but instead of 'спасибо' I got some unreadable symbols.
-
keyboardP over 10 years@oblomov - That's strange. I copy/pasted your text into a new Notepad file and saved it as
UTF-8
. Tried the code above and it works fine. Are you sure you're not accidentally loading the version you saved asUnicode
in Notepad? -
mtkachenko over 10 years@keyboardP - I checked: file is saved in utf8. My code in first post.
-
Richard Robertson almost 9 yearsThe Wiki' article simply points out that Microsoft's software is stupid. They've caused a lot of programmers to think that BOM is part of UTF8 - it's not. I don't mind a down vote for this response since I'm just griping about the extra work I have to do to parse a text file because MS doesn't follow standards. You might be better off providing the link to the WHOLE article: en.wikipedia.org/wiki/UTF-8
-
alireza amini almost 8 yearsBasically it use your current system Encoding Format . And return what you can see on your systems texts
-
mtkachenko almost 8 yearsDifferent servers can have different Encoding.Default so it's not safe.