Invalid characters in File.ReadAllText

22,767

Solution 1

This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.

The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.

Solution 2

Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.

Code sample:

string readText = File.ReadAllText(path, Encoding.Default);  // <-- change the encoding to whatever the encoding really is

If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown

Solution 3

You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)

The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.

For example:

Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);

It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.

Share:
22,767
mrK
Author by

mrK

Updated on March 19, 2020

Comments

  • mrK
    mrK about 4 years

    I'm calling File.ReadAllText() in a program designed to format some files that I have.

    Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains (65533) symbols where the ® (174) should be.

    What would cause this and how can I fix it?

  • mrK
    mrK about 11 years
    Is there a reason that the method doesn't use Encoding.Default as it's default encoding?
  • Reed Copsey
    Reed Copsey about 11 years
    @mrK Not sure why it's that way, but that's what the framework designer's chose to use. It's documented, but I agree, an odd choice.
  • Martin Liversage
    Martin Liversage about 11 years
    One explanation could be that Encoding.Default uses the so-called current ANSI code page of the system which varies from system to system. Using UTF-8 avoids encoding errors you get from encoding and decoding on systems with different current ANSI code pages. Furthermore UTF-8 can encode all of UNICODE.
  • Stefan Steiger
    Stefan Steiger almost 4 years
    Encoding.Default no longer works in .NET Core, always returns utf8...