Invalid characters in File.ReadAllText

c# text character-encoding special-characters symbols

22,767

Solution 1

This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.

The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.

Solution 2

Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.

Code sample:

string readText = File.ReadAllText(path, Encoding.Default);  // <-- change the encoding to whatever the encoding really is

If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown

Solution 3

You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)

The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.

For example:

Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);

It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.

22,767

Author by

mrK

Updated on March 19, 2020

Comments

mrK about 4 years

I'm calling File.ReadAllText() in a program designed to format some files that I have.

Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be.

What would cause this and how can I fix it?
mrK about 11 years

Is there a reason that the method doesn't use Encoding.Default as it's default encoding?
Reed Copsey about 11 years

@mrK Not sure why it's that way, but that's what the framework designer's chose to use. It's documented, but I agree, an odd choice.
Martin Liversage about 11 years

One explanation could be that Encoding.Default uses the so-called current ANSI code page of the system which varies from system to system. Using UTF-8 avoids encoding errors you get from encoding and decoding on systems with different current ANSI code pages. Furthermore UTF-8 can encode all of UNICODE.
Stefan Steiger almost 4 years

Encoding.Default no longer works in .NET Core, always returns utf8...