C# - Detecting encoding in a file, write change to file using the found encoding

15,124

Solution 1

Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read

The gist of the article is

  • If the BOM (byte order marker) exists then you're golden
  • Else it's guess work and heuristics

However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.

String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
  f1 = reader.ReadToEnd().ToLower();
  encoding = reader.CurrentEncoding;
}

if (f1.Contains(oPath))
{
  f1 = f1.Replace(oPath, nPath);
  File.WriteAllText(fileList[i], f1, encoding);
}

Solution 2

By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.

my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p

Solution 3

Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag

        string file = "File to open";
        string text;
        Encoding encoding;
        string oldValue = "string to be replaced";
        string replacementValue = "New string";

        var attributes = File.GetAttributes(file);
        File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);

        using (StreamReader reader = new StreamReader(file, Encoding.Default))
        {
            text = reader.ReadToEnd();
            encoding = reader.CurrentEncoding;
            reader.Close();
        }

        bool changedValue = false;
        if (text.Contains(oldValue))
        {
            text = text.Replace(oldValue, replacementValue);
            changedValue = true;
        }

        if (changedValue)
        {
            using (StreamWriter write = new StreamWriter(file, false, encoding))
            {
                write.Write(text.ToString());
                write.Close();
            }
            File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
        }

Solution 4

I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.

Taken form here.

With regard to encodings - you will need to have identified the encoding in order to use the StreamReader.

However, the StreamReader itself can help if you create it with one of the constructor overloads that allows you to supply the flag detectEncodingFromByteOrderMarks as true (or you can use Encoding.GetPreamble and look at the byte preamble yourself).

Both these methods will only help auto-detect UTF based encodings though - so any ANSI encodings with a specified codepage will probably not be parsed correctly.

Share:
15,124
cc0
Author by

cc0

Updated on June 11, 2022

Comments

  • cc0
    cc0 almost 2 years

    I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.

    What would be the prettiest way of doing that in C# .net 2.0?

    My code looks very simple as of now;

    String f1 = File.ReadAllText(fileList[i]).ToLower();
    
    if (f1.Contains(oPath))
    {
        f1 = f1.Replace(oPath, nPath);
        File.WriteAllText(fileList[i], f1, Encoding.Unicode);
    }
    

    I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.

    Would greatly appreciate any help here.