Determine TextFile Encoding?

34,630

The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:

Dim data() As Byte = File.ReadAllBytes("test.txt")

Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.

The easiest way to automatically detect the encoding from the BOM is to let the StreamReader do it for you. In the constructor of the StreamReader, you can pass True for the detectEncodingFromByteOrderMarks argument. Then you can get the encoding of the stream by accessing its CurrentEncoding property. However, the CurrentEncoding property won't work until after the StreamReader has read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:

Public Function GetFileEncoding(filePath As String) As Encoding
    Using sr As New StreamReader(filePath, True)
        sr.Read()
        Return sr.CurrentEncoding
    End Using
End Function

However, the problem to this approach is that the MSDN seems to imply that the StreamReader may only detect certain kinds of encodings:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

Also, if the StreamReader is incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding, without giving you any indication that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:

  • UTF-8: EF BB BF
  • UTF-16 big endian byte order: FE FF
  • UTF-16 little endian byte order: FF FE
  • UTF-32 big endian byte order: 00 00 FE FF
  • UTF-32 little endian byte order: FF FE 00 00

So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:

If (data(0) = &HFF) And (data(1) = &HFE) Then
    ' Data starts with UTF-16 (little endian) BOM
End If

Conveniently, the Encoding class in .NET contains a method called GetPreamble which returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to check if a byte-array starts with the BOM for Unicode (UTF-16, little-endian), you could just do this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
    Dim bom() As Byte = Encoding.Unicode.GetPreamble()
    If (data(0) = bom(0)) And (data(1) = bom(1) Then
        Return True
    Else
        Return False
    End If
End Function

Of course, the above function assumes that the data is at least two-bytes in length and the BOM is exactly two bytes. So, while it illustrates how to do it as clearly as possible, it's not the safest way to do it. To make it tolerant of different array lengths, especially since the BOM lengths themselves can vary from one encoding to the next, it would be safer to do something like this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
    Dim bom() As Byte = Encoding.Unicode.GetPreamble()
    Return data.Zip(bom, Function(x, y) x = y).All(Function(x) x)
End Function

So, the problem then becomes, how do you get a list of all the encodings? Well it just so happens that the .NET Encoding class also provides a shared (static) method called GetEncodings which returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:

Public Function DetectEncodingFromBom(data() As Byte) As Encoding
    Return Encoding.GetEncodings().
        Select(Function(info) info.GetEncoding()).
        FirstOrDefault(Function(enc) DataStartsWithBom(data, enc))
End Function

Private Function DataStartsWithBom(data() As Byte, enc As Encoding) As Boolean
    Dim bom() As Byte = enc.GetPreamble()
    If bom.Length <> 0 Then
        Return data.
            Zip(bom, Function(x, y) x = y).
            All(Function(x) x)
    Else
        Return False
    End If
End Function

Once you make a function like that, then you could detect the encoding of a file like this:

Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
    Console.WriteLine("Unable to detect encoding")
Else
    Console.WriteLine(detectedEncoding.EncodingName)
End If

However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it's probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won't work.

As you correctly observed, there are applications that still automatically detect the encoding even when no BOM is present, but they do it through heuristics (i.e. educated guessing) and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This page offers some interesting insights on the problems inside the Notepad auto-detection algorithm. This page shows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:

Even though this question was for C#, you may also find the answers to it useful.

Share:
34,630

Related videos on Youtube

ElektroStudios
Author by

ElektroStudios

Updated on July 09, 2022

Comments

  • ElektroStudios
    ElektroStudios almost 2 years

    I need to determine if a text file's content is equal to one of these text encodings:

    System.Text.Encoding.ASCII
    System.Text.Encoding.BigEndianUnicode ' UTF-L 16
    System.Text.Encoding.Default ' ANSI
    System.Text.Encoding.Unicode ' UTF16
    System.Text.Encoding.UTF32
    System.Text.Encoding.UTF7
    System.Text.Encoding.UTF8
    

    I don't know how to read the byte marks of the files, I've seen snippets doing this but only can determine if file is ASCII or Unicode, therefore I need something more universal.

    • Daniel Hilgarth
      Daniel Hilgarth over 10 years
      You can't reliably do that.
    • adripanico
      adripanico over 10 years
    • ElektroStudios
      ElektroStudios over 10 years
      @adripanico please see the comments under that answer, I've tested it also, but it returns the encoding of VS, not encoding of the file. it returns "UTF8" when file is in ANSI encoding.
    • ElektroStudios
      ElektroStudios over 10 years
      @Daniel Hilgarth please can you say why do you think that?, I'm not an expert but I think if it could not be done reliably then the "notepad.exe" could not reliably know what type of encoding uses a file but notepad always knows and always displays the exact encoding of those when you press the button "save".
    • Daniel Hilgarth
      Daniel Hilgarth over 10 years
      @ElektroHacker: There are certain indicators from which you can try to infer it, but you can get false positives with this approach. For example, UTF8 can be used without a BOM and as such it looks quite similar to a ASCII file. Oh and there is no encryption involved. It is encoding
    • ElektroStudios
      ElektroStudios over 10 years
      @Daniel Hilgarth I know the "encryption" word was a GoogleTranslate fail traduction, thanks for comment
    • ElektroStudios
      ElektroStudios over 10 years
      Than I need to work with text files in an application, which is the best I can do? tell the user to specify the text encoding by himself when opening a file?
    • Steven Doggart
      Steven Doggart over 10 years
      Are you only interested in knowing how to read the BOM, or are you also interested in determining the encoding even when a BOM is not present?
    • ElektroStudios
      ElektroStudios over 10 years
      @Steven Doggart I don't knew files could contain BOM or also can not contains them so I don't know really what I need, I don't know if most used textfile encodings have bytes order mark or haven't, I just need to know the encoding of the textfile, but this seems really really hard...
    • ElektroStudios
      ElektroStudios over 10 years
      There is not any generic way to do this? how can do it Notepadd++ with really nice precission?
    • ElektroStudios
      ElektroStudios over 10 years
      I don't know neither why a moderator marked this answer while I'm asking for VBNET solution and that answer is for C# and also supposed solution don't works...
  • ElektroStudios
    ElektroStudios over 10 years
    If I could set in a button my favorite answers, this would be one of them, thankyou so much!
  • Steven Doggart
    Steven Doggart over 10 years
    Thanks! I added info about how to do it with the StreamReader which is an important point to include. I think the reason why it was failing for the people, in that other C# answer that was marked as the duplicate, is because they didn't do a Read on the stream first before getting the encoding. Until it reads past the BOM, it will just return the default UTF-8 encoding.
  • ElektroStudios
    ElektroStudios over 10 years
    The StreamReader method returns UTF8 for ANSI files, I still preffering the first method you've wrote because it detects nice UTF8 files and also if any encoding is detected then I can return a "Most possibly encoding" chance as ANSI encoding, and that worked really nice for me to detect ANSI files and UTF file, but I think the same can not be done with the sr method in few lines, thanks again!
  • Steven Doggart
    Steven Doggart over 10 years
    Correct, since ANSI encoded files never have a BOM, the StreamReader will always assume the default UTF-8. I'm still don't understand why everyone voted to close this as a duplicate. The other answer is incorrect and in C#. Strange. I voted to reopen it. We'll see if that goes anywhere. In any case, glad I could help.
  • Mad Dog Tannen
    Mad Dog Tannen about 10 years
    Outstanding answer! Thanks it really saved my day!