Invalid characters in File.ReadAllText
Solution 1
This is likely due to a mismatch in the Encoding
. Use the ReadAllText overload which allows you to specify the proper Encoding
to use when reading the file.
The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.
Solution 2
Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
string readText = File.ReadAllText(path, Encoding.Default); // <-- change the encoding to whatever the encoding really is
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown
Solution 3
You need to specify the encoding when you call File.ReadAllText
, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)
The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.
mrK
Updated on March 19, 2020Comments
-
mrK about 4 years
I'm calling
File.ReadAllText()
in a program designed to format some files that I have.Some of these files contain the
®
(174) symbol. However, when the text is being read, the returned string contains�
(65533) symbols where the®
(174) should be.What would cause this and how can I fix it?
-
mrK about 11 yearsIs there a reason that the method doesn't use Encoding.Default as it's default encoding?
-
Reed Copsey about 11 years@mrK Not sure why it's that way, but that's what the framework designer's chose to use. It's documented, but I agree, an odd choice.
-
Martin Liversage about 11 yearsOne explanation could be that
Encoding.Default
uses the so-called current ANSI code page of the system which varies from system to system. Using UTF-8 avoids encoding errors you get from encoding and decoding on systems with different current ANSI code pages. Furthermore UTF-8 can encode all of UNICODE. -
Stefan Steiger almost 4 yearsEncoding.Default no longer works in .NET Core, always returns utf8...