Determine TextFile Encoding?
The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:
Dim data() As Byte = File.ReadAllBytes("test.txt")
Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.
The easiest way to automatically detect the encoding from the BOM is to let the StreamReader
do it for you. In the constructor of the StreamReader
, you can pass True
for the detectEncodingFromByteOrderMarks
argument. Then you can get the encoding of the stream by accessing its CurrentEncoding
property. However, the CurrentEncoding
property won't work until after the StreamReader
has read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:
Public Function GetFileEncoding(filePath As String) As Encoding
Using sr As New StreamReader(filePath, True)
sr.Read()
Return sr.CurrentEncoding
End Using
End Function
However, the problem to this approach is that the MSDN seems to imply that the StreamReader
may only detect certain kinds of encodings:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.
Also, if the StreamReader
is incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding, without giving you any indication that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:
- UTF-8:
EF BB BF
- UTF-16 big endian byte order:
FE FF
- UTF-16 little endian byte order:
FF FE
- UTF-32 big endian byte order:
00 00 FE FF
- UTF-32 little endian byte order:
FF FE 00 00
So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:
If (data(0) = &HFF) And (data(1) = &HFE) Then
' Data starts with UTF-16 (little endian) BOM
End If
Conveniently, the Encoding
class in .NET contains a method called GetPreamble
which returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to check if a byte-array starts with the BOM for Unicode (UTF-16, little-endian), you could just do this:
Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
If (data(0) = bom(0)) And (data(1) = bom(1) Then
Return True
Else
Return False
End If
End Function
Of course, the above function assumes that the data is at least two-bytes in length and the BOM is exactly two bytes. So, while it illustrates how to do it as clearly as possible, it's not the safest way to do it. To make it tolerant of different array lengths, especially since the BOM lengths themselves can vary from one encoding to the next, it would be safer to do something like this:
Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
Return data.Zip(bom, Function(x, y) x = y).All(Function(x) x)
End Function
So, the problem then becomes, how do you get a list of all the encodings? Well it just so happens that the .NET Encoding
class also provides a shared (static) method called GetEncodings
which returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:
Public Function DetectEncodingFromBom(data() As Byte) As Encoding
Return Encoding.GetEncodings().
Select(Function(info) info.GetEncoding()).
FirstOrDefault(Function(enc) DataStartsWithBom(data, enc))
End Function
Private Function DataStartsWithBom(data() As Byte, enc As Encoding) As Boolean
Dim bom() As Byte = enc.GetPreamble()
If bom.Length <> 0 Then
Return data.
Zip(bom, Function(x, y) x = y).
All(Function(x) x)
Else
Return False
End If
End Function
Once you make a function like that, then you could detect the encoding of a file like this:
Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
Console.WriteLine("Unable to detect encoding")
Else
Console.WriteLine(detectedEncoding.EncodingName)
End If
However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it's probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won't work.
As you correctly observed, there are applications that still automatically detect the encoding even when no BOM is present, but they do it through heuristics (i.e. educated guessing) and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This page offers some interesting insights on the problems inside the Notepad auto-detection algorithm. This page shows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:
Even though this question was for C#, you may also find the answers to it useful.
Related videos on Youtube
ElektroStudios
Updated on July 09, 2022Comments
-
ElektroStudios almost 2 years
I need to determine if a text file's content is equal to one of these text encodings:
System.Text.Encoding.ASCII System.Text.Encoding.BigEndianUnicode ' UTF-L 16 System.Text.Encoding.Default ' ANSI System.Text.Encoding.Unicode ' UTF16 System.Text.Encoding.UTF32 System.Text.Encoding.UTF7 System.Text.Encoding.UTF8
I don't know how to read the byte marks of the files, I've seen snippets doing this but only can determine if file is ASCII or Unicode, therefore I need something more universal.
-
Daniel Hilgarth over 10 yearsYou can't reliably do that.
-
adripanico over 10 years
-
ElektroStudios over 10 years@adripanico please see the comments under that answer, I've tested it also, but it returns the encoding of VS, not encoding of the file. it returns "UTF8" when file is in ANSI encoding.
-
ElektroStudios over 10 years@Daniel Hilgarth please can you say why do you think that?, I'm not an expert but I think if it could not be done reliably then the "notepad.exe" could not reliably know what type of encoding uses a file but notepad always knows and always displays the exact encoding of those when you press the button "save".
-
Daniel Hilgarth over 10 years@ElektroHacker: There are certain indicators from which you can try to infer it, but you can get false positives with this approach. For example, UTF8 can be used without a BOM and as such it looks quite similar to a ASCII file. Oh and there is no encryption involved. It is encoding
-
ElektroStudios over 10 years@Daniel Hilgarth I know the "encryption" word was a GoogleTranslate fail traduction, thanks for comment
-
ElektroStudios over 10 yearsThan I need to work with text files in an application, which is the best I can do? tell the user to specify the text encoding by himself when opening a file?
-
Steven Doggart over 10 yearsAre you only interested in knowing how to read the BOM, or are you also interested in determining the encoding even when a BOM is not present?
-
ElektroStudios over 10 years@Steven Doggart I don't knew files could contain BOM or also can not contains them so I don't know really what I need, I don't know if most used textfile encodings have bytes order mark or haven't, I just need to know the encoding of the textfile, but this seems really really hard...
-
ElektroStudios over 10 yearsThere is not any generic way to do this? how can do it Notepadd++ with really nice precission?
-
ElektroStudios over 10 yearsI don't know neither why a moderator marked this answer while I'm asking for VBNET solution and that answer is for C# and also supposed solution don't works...
-
-
ElektroStudios over 10 yearsIf I could set in a button my favorite answers, this would be one of them, thankyou so much!
-
Steven Doggart over 10 yearsThanks! I added info about how to do it with the
StreamReader
which is an important point to include. I think the reason why it was failing for the people, in that other C# answer that was marked as the duplicate, is because they didn't do aRead
on the stream first before getting the encoding. Until it reads past the BOM, it will just return the default UTF-8 encoding. -
ElektroStudios over 10 yearsThe StreamReader method returns UTF8 for ANSI files, I still preffering the first method you've wrote because it detects nice UTF8 files and also if any encoding is detected then I can return a "Most possibly encoding" chance as ANSI encoding, and that worked really nice for me to detect ANSI files and UTF file, but I think the same can not be done with the sr method in few lines, thanks again!
-
Steven Doggart over 10 yearsCorrect, since ANSI encoded files never have a BOM, the
StreamReader
will always assume the default UTF-8. I'm still don't understand why everyone voted to close this as a duplicate. The other answer is incorrect and in C#. Strange. I voted to reopen it. We'll see if that goes anywhere. In any case, glad I could help. -
Mad Dog Tannen about 10 yearsOutstanding answer! Thanks it really saved my day!