How to detect the character encoding of a text file?

124,008

Solution 1

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

UTF-32

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

US-ASCII

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

UTF-8

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

UTF-16

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.

Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

XML

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

None of the above

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

A reasonable default

If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.

Solution 2

If you want to pursue a "simple" solution, you might find this class I put together useful:

http://www.architectshack.com/TextFileEncodingDetector.ashx

It does the BOM detection automatically first, and then tries to differentiate between Unicode encodings without BOM, vs some other default encoding (generally Windows-1252, incorrectly labelled as Encoding.ASCII in .Net).

As noted above, a "heavier" solution involving NCharDet or MLang may be more appropriate, and as I note on the overview page of this class, the best is to provide some form of interactivity with the user if at all possible, because there simply is no 100% detection rate possible!

Snippet in case the site is offline:

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace KlerksSoft
{
    public static class TextFileEncodingDetector
    {
        /*
         * Simple class to handle text file encoding woes (in a primarily English-speaking tech 
         *      world).
         * 
         *  - This code is fully managed, no shady calls to MLang (the unmanaged codepage
         *      detection library originally developed for Internet Explorer).
         * 
         *  - This class does NOT try to detect arbitrary codepages/charsets, it really only
         *      aims to differentiate between some of the most common variants of Unicode 
         *      encoding, and a "default" (western / ascii-based) encoding alternative provided
         *      by the caller.
         *      
         *  - As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and 
         *      Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a 
         *      heuristic - so the more of the file we can sample the better the guess. If you 
         *      are going to read the whole file into memory at some point, then best to pass 
         *      in the whole byte byte array directly. Otherwise, decide how to trade off 
         *      reliability against performance / memory usage.
         *      
         *  - The UTF-8 detection heuristic only works for western text, as it relies on 
         *      the presence of UTF-8 encoded accented and other characters found in the upper 
         *      ranges of the Latin-1 and (particularly) Windows-1252 codepages.
         *  
         *  - For more general detection routines, see existing projects / resources:
         *    - MLang - Microsoft library originally for IE6, available in Windows XP and later APIs now (I think?)
         *      - MLang .Net bindings: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
         *    - CharDet - Mozilla browser's detection routines
         *      - Ported to Java then .Net: http://www.conceptdevelopment.net/Localization/NCharDet/
         *      - Ported straight to .Net: http://code.google.com/p/chardetsharp/source/browse
         *  
         * Copyright Tao Klerks, 2010-2012, [email protected]
         * Licensed under the modified BSD license:
         * 
Redistribution and use in source and binary forms, with or without modification, are 
permitted provided that the following conditions are met:
 - Redistributions of source code must retain the above copyright notice, this list of 
conditions and the following disclaimer.
 - Redistributions in binary form must reproduce the above copyright notice, this list 
of conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
 - The name of the author may not be used to endorse or promote products derived from 
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, 
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY 
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 
OF SUCH DAMAGE.
         * 
         * CHANGELOG:
         *  - 2012-02-03: 
         *    - Simpler methods, removing the silly "DefaultEncoding" parameter (with "??" operator, saves no typing)
         *    - More complete methods
         *      - Optionally return indication of whether BOM was found in "Detect" methods
         *      - Provide straight-to-string method for byte arrays (GetStringFromByteArray)
         */

        const long _defaultHeuristicSampleSize = 0x10000; //completely arbitrary - inappropriate for high numbers of files / high speed requirements

        public static Encoding DetectTextFileEncoding(string InputFilename)
        {
            using (FileStream textfileStream = File.OpenRead(InputFilename))
            {
                return DetectTextFileEncoding(textfileStream, _defaultHeuristicSampleSize);
            }
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize)
        {
            bool uselessBool = false;
            return DetectTextFileEncoding(InputFileStream, _defaultHeuristicSampleSize, out uselessBool);
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize, out bool HasBOM)
        {
            if (InputFileStream == null)
                throw new ArgumentNullException("Must provide a valid Filestream!", "InputFileStream");

            if (!InputFileStream.CanRead)
                throw new ArgumentException("Provided file stream is not readable!", "InputFileStream");

            if (!InputFileStream.CanSeek)
                throw new ArgumentException("Provided file stream cannot seek!", "InputFileStream");

            Encoding encodingFound = null;

            long originalPos = InputFileStream.Position;

            InputFileStream.Position = 0;


            //First read only what we need for BOM detection
            byte[] bomBytes = new byte[InputFileStream.Length > 4 ? 4 : InputFileStream.Length];
            InputFileStream.Read(bomBytes, 0, bomBytes.Length);

            encodingFound = DetectBOMBytes(bomBytes);

            if (encodingFound != null)
            {
                InputFileStream.Position = originalPos;
                HasBOM = true;
                return encodingFound;
            }


            //BOM Detection failed, going for heuristics now.
            //  create sample byte array and populate it
            byte[] sampleBytes = new byte[HeuristicSampleSize > InputFileStream.Length ? InputFileStream.Length : HeuristicSampleSize];
            Array.Copy(bomBytes, sampleBytes, bomBytes.Length);
            if (InputFileStream.Length > bomBytes.Length)
                InputFileStream.Read(sampleBytes, bomBytes.Length, sampleBytes.Length - bomBytes.Length);
            InputFileStream.Position = originalPos;

            //test byte array content
            encodingFound = DetectUnicodeInByteSampleByHeuristics(sampleBytes);

            HasBOM = false;
            return encodingFound;
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData)
        {
            bool uselessBool = false;
            return DetectTextByteArrayEncoding(TextData, out uselessBool);
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData, out bool HasBOM)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                HasBOM = true;
                return encodingFound;
            }
            else
            {
                //test byte array content
                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData);

                HasBOM = false;
                return encodingFound;
            }
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding)
        {
            return GetStringFromByteArray(TextData, DefaultEncoding, _defaultHeuristicSampleSize);
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding, long MaxHeuristicSampleSize)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                //For some reason, the default encodings don't detect/swallow their own preambles!!
                return encodingFound.GetString(TextData, encodingFound.GetPreamble().Length, TextData.Length - encodingFound.GetPreamble().Length);
            }
            else
            {
                byte[] heuristicSample = null;
                if (TextData.Length > MaxHeuristicSampleSize)
                {
                    heuristicSample = new byte[MaxHeuristicSampleSize];
                    Array.Copy(TextData, heuristicSample, MaxHeuristicSampleSize);
                }
                else
                {
                    heuristicSample = TextData;
                }

                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData) ?? DefaultEncoding;
                return encodingFound.GetString(TextData);
            }
        }


        public static Encoding DetectBOMBytes(byte[] BOMBytes)
        {
            if (BOMBytes == null)
                throw new ArgumentNullException("Must provide a valid BOM byte array!", "BOMBytes");

            if (BOMBytes.Length < 2)
                return null;

            if (BOMBytes[0] == 0xff 
                && BOMBytes[1] == 0xfe 
                && (BOMBytes.Length < 4 
                    || BOMBytes[2] != 0 
                    || BOMBytes[3] != 0
                    )
                )
                return Encoding.Unicode;

            if (BOMBytes[0] == 0xfe 
                && BOMBytes[1] == 0xff
                )
                return Encoding.BigEndianUnicode;

            if (BOMBytes.Length < 3)
                return null;

            if (BOMBytes[0] == 0xef && BOMBytes[1] == 0xbb && BOMBytes[2] == 0xbf)
                return Encoding.UTF8;

            if (BOMBytes[0] == 0x2b && BOMBytes[1] == 0x2f && BOMBytes[2] == 0x76)
                return Encoding.UTF7;

            if (BOMBytes.Length < 4)
                return null;

            if (BOMBytes[0] == 0xff && BOMBytes[1] == 0xfe && BOMBytes[2] == 0 && BOMBytes[3] == 0)
                return Encoding.UTF32;

            if (BOMBytes[0] == 0 && BOMBytes[1] == 0 && BOMBytes[2] == 0xfe && BOMBytes[3] == 0xff)
                return Encoding.GetEncoding(12001);

            return null;
        }

        public static Encoding DetectUnicodeInByteSampleByHeuristics(byte[] SampleBytes)
        {
            long oddBinaryNullsInSample = 0;
            long evenBinaryNullsInSample = 0;
            long suspiciousUTF8SequenceCount = 0;
            long suspiciousUTF8BytesTotal = 0;
            long likelyUSASCIIBytesInSample = 0;

            //Cycle through, keeping count of binary null positions, possible UTF-8 
            //  sequences from upper ranges of Windows-1252, and probable US-ASCII 
            //  character counts.

            long currentPos = 0;
            int skipUTF8Bytes = 0;

            while (currentPos < SampleBytes.Length)
            {
                //binary null distribution
                if (SampleBytes[currentPos] == 0)
                {
                    if (currentPos % 2 == 0)
                        evenBinaryNullsInSample++;
                    else
                        oddBinaryNullsInSample++;
                }

                //likely US-ASCII characters
                if (IsCommonUSASCIIByte(SampleBytes[currentPos]))
                    likelyUSASCIIBytesInSample++;

                //suspicious sequences (look like UTF-8)
                if (skipUTF8Bytes == 0)
                {
                    int lengthFound = DetectSuspiciousUTF8SequenceLength(SampleBytes, currentPos);

                    if (lengthFound > 0)
                    {
                        suspiciousUTF8SequenceCount++;
                        suspiciousUTF8BytesTotal += lengthFound;
                        skipUTF8Bytes = lengthFound - 1;
                    }
                }
                else
                {
                    skipUTF8Bytes--;
                }

                currentPos++;
            }

            //1: UTF-16 LE - in english / european environments, this is usually characterized by a 
            //  high proportion of odd binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of even binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.Unicode;


            //2: UTF-16 BE - in english / european environments, this is usually characterized by a 
            //  high proportion of even binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of odd binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.BigEndianUnicode;


            //3: UTF-8 - Martin Dürst outlines a method for detecting whether something CAN be UTF-8 content 
            //  using regexp, in his w3c.org unicode FAQ entry: 
            //  http://www.w3.org/International/questions/qa-forms-utf-8
            //  adapted here for C#.
            string potentiallyMangledString = Encoding.ASCII.GetString(SampleBytes);
            Regex UTF8Validator = new Regex(@"\A(" 
                + @"[\x09\x0A\x0D\x20-\x7E]"
                + @"|[\xC2-\xDF][\x80-\xBF]"
                + @"|\xE0[\xA0-\xBF][\x80-\xBF]"
                + @"|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}"
                + @"|\xED[\x80-\x9F][\x80-\xBF]"
                + @"|\xF0[\x90-\xBF][\x80-\xBF]{2}"
                + @"|[\xF1-\xF3][\x80-\xBF]{3}"
                + @"|\xF4[\x80-\x8F][\x80-\xBF]{2}"
                + @")*\z");
            if (UTF8Validator.IsMatch(potentiallyMangledString))
            {
                //Unfortunately, just the fact that it CAN be UTF-8 doesn't tell you much about probabilities.
                //If all the characters are in the 0-127 range, no harm done, most western charsets are same as UTF-8 in these ranges.
                //If some of the characters were in the upper range (western accented characters), however, they would likely be mangled to 2-byte by the UTF-8 encoding process.
                // So, we need to play stats.

                // The "Random" likelihood of any pair of randomly generated characters being one 
                //   of these "suspicious" character sequences is:
                //     128 / (256 * 256) = 0.2%.
                //
                // In western text data, that is SIGNIFICANTLY reduced - most text data stays in the <127 
                //   character range, so we assume that more than 1 in 500,000 of these character 
                //   sequences indicates UTF-8. The number 500,000 is completely arbitrary - so sue me.
                //
                // We can only assume these character sequences will be rare if we ALSO assume that this
                //   IS in fact western text - in which case the bulk of the UTF-8 encoded data (that is 
                //   not already suspicious sequences) should be plain US-ASCII bytes. This, I 
                //   arbitrarily decided, should be 80% (a random distribution, eg binary data, would yield 
                //   approx 40%, so the chances of hitting this threshold by accident in random data are 
                //   VERY low). 

                if ((suspiciousUTF8SequenceCount * 500000.0 / SampleBytes.Length >= 1) //suspicious sequences
                    && (
                           //all suspicious, so cannot evaluate proportion of US-Ascii
                           SampleBytes.Length - suspiciousUTF8BytesTotal == 0 
                           ||
                           likelyUSASCIIBytesInSample * 1.0 / (SampleBytes.Length - suspiciousUTF8BytesTotal) >= 0.8
                       )
                    )
                    return Encoding.UTF8;
            }

            return null;
        }

        private static bool IsCommonUSASCIIByte(byte testByte)
        {
            if (testByte == 0x0A //lf
                || testByte == 0x0D //cr
                || testByte == 0x09 //tab
                || (testByte >= 0x20 && testByte <= 0x2F) //common punctuation
                || (testByte >= 0x30 && testByte <= 0x39) //digits
                || (testByte >= 0x3A && testByte <= 0x40) //common punctuation
                || (testByte >= 0x41 && testByte <= 0x5A) //capital letters
                || (testByte >= 0x5B && testByte <= 0x60) //common punctuation
                || (testByte >= 0x61 && testByte <= 0x7A) //lowercase letters
                || (testByte >= 0x7B && testByte <= 0x7E) //common punctuation
                )
                return true;
            else
                return false;
        }

        private static int DetectSuspiciousUTF8SequenceLength(byte[] SampleBytes, long currentPos)
        {
            int lengthFound = 0;

            if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x81 
                    || SampleBytes[currentPos + 1] == 0x8D 
                    || SampleBytes[currentPos + 1] == 0x8F
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0x90 
                    || SampleBytes[currentPos + 1] == 0x9D
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] >= 0xA0 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC3
                )
            {
                if (SampleBytes[currentPos + 1] >= 0x80 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC5
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92 
                    || SampleBytes[currentPos + 1] == 0x93
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xA0 
                    || SampleBytes[currentPos + 1] == 0xA1
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xB8 
                    || SampleBytes[currentPos + 1] == 0xBD 
                    || SampleBytes[currentPos + 1] == 0xBE
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC6
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92)
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xCB
                )
            {
                if (SampleBytes[currentPos + 1] == 0x86 
                    || SampleBytes[currentPos + 1] == 0x9C
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 2 
                && SampleBytes[currentPos] == 0xE2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x80)
                {
                    if (SampleBytes[currentPos + 2] == 0x93 
                        || SampleBytes[currentPos + 2] == 0x94
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x98 
                        || SampleBytes[currentPos + 2] == 0x99 
                        || SampleBytes[currentPos + 2] == 0x9A
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x9C 
                        || SampleBytes[currentPos + 2] == 0x9D 
                        || SampleBytes[currentPos + 2] == 0x9E
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA0 
                        || SampleBytes[currentPos + 2] == 0xA1 
                        || SampleBytes[currentPos + 2] == 0xA2
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA6)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB0)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB9 
                        || SampleBytes[currentPos + 2] == 0xBA
                        )
                        lengthFound = 3;
                }
                else if (SampleBytes[currentPos + 1] == 0x82 
                    && SampleBytes[currentPos + 2] == 0xAC
                    )
                    lengthFound = 3;
                else if (SampleBytes[currentPos + 1] == 0x84 
                    && SampleBytes[currentPos + 2] == 0xA2
                    )
                    lengthFound = 3;
            }

            return lengthFound;
        }

    }
}

Solution 3

Use StreamReader and direct it to detect the encoding for you:

using (var reader = new System.IO.StreamReader(path, true))
{
    var currentEncoding = reader.CurrentEncoding;
}

And use Code Page Identifiers https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx in order to switch logic depending on it.

Solution 4

Several answers are here but nobody has posted usefull code.

Here is my code that detects all encodings that Microsoft detects in Framework 4 in the StreamReader class.

Obviously you must call this function immediately after opening the stream before reading anything else from the stream because the BOM are the first bytes in the stream.

This function requires a Stream that can seek (for example a FileStream). If you have a Stream that cannot seek you must write a more complicated code that returns a Byte buffer with the bytes that have already been read but that are not BOM.

public static Encoding DetectEncoding(String s_Path)
{
    using (FileStream i_Stream = new FileStream(s_Path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
    {
        return DetectEncoding(i_Stream);
    }
}

/// <summary>
/// UTF8    : EF BB BF
/// UTF16 BE: FE FF
/// UTF16 LE: FF FE
/// UTF32 BE: 00 00 FE FF
/// UTF32 LE: FF FE 00 00
/// </summary>
public static Encoding DetectEncoding(Stream i_Stream)
{
    if (!i_Stream.CanSeek || !i_Stream.CanRead)
        throw new Exception("DetectEncoding() requires a seekable and readable Stream");

    // Try to read 4 bytes. If the stream is shorter, less bytes will be read.
    Byte[] u8_Buf = new Byte[4];
    int s32_Count = i_Stream.Read(u8_Buf, 0, 4);
    if (s32_Count >= 2)
    {
        if (u8_Buf[0] == 0xFE && u8_Buf[1] == 0xFF)
        {
            i_Stream.Position = 2;
            return new UnicodeEncoding(true, true);
        }

        if (u8_Buf[0] == 0xFF && u8_Buf[1] == 0xFE)
        {
            if (s32_Count >= 4 && u8_Buf[2] == 0 && u8_Buf[3] == 0)
            {
                i_Stream.Position = 4;
                return new UTF32Encoding(false, true);
            }
            else
            {
                i_Stream.Position = 2;
                return new UnicodeEncoding(false, true);
            }
        }

        if (s32_Count >= 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
        {
            i_Stream.Position = 3;
            return Encoding.UTF8;
        }

        if (s32_Count >= 4 && u8_Buf[0] == 0 && u8_Buf[1] == 0 && u8_Buf[2] == 0xFE && u8_Buf[3] == 0xFF)
        {
            i_Stream.Position = 4;
            return new UTF32Encoding(true, true);
        }
    }

    i_Stream.Position = 0;
    return Encoding.Default;
}

Solution 5

I use Ude that is a C# port of Mozilla Universal Charset Detector. It is easy to use and gives some really good results.

Share:
124,008

Related videos on Youtube

Cédric Boivin
Author by

Cédric Boivin

Updated on July 05, 2022

Comments

  • Cédric Boivin
    Cédric Boivin almost 2 years

    I try to detect which character encoding is used in my file.

    I try with this code to get the standard encoding

    public static Encoding GetFileEncoding(string srcFile)
        {
          // *** Use Default of Encoding.Default (Ansi CodePage)
          Encoding enc = Encoding.Default;
    
          // *** Detect byte order mark if any - otherwise assume default
          byte[] buffer = new byte[5];
          FileStream file = new FileStream(srcFile, FileMode.Open);
          file.Read(buffer, 0, 5);
          file.Close();
    
          if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
            enc = Encoding.UTF8;
          else if (buffer[0] == 0xfe && buffer[1] == 0xff)
            enc = Encoding.Unicode;
          else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
            enc = Encoding.UTF32;
          else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
            enc = Encoding.UTF7;
          else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
            // 1201 unicodeFFFE Unicode (Big-Endian)
            enc = Encoding.GetEncoding(1201);      
          else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
            // 1200 utf-16 Unicode
            enc = Encoding.GetEncoding(1200);
    
    
          return enc;
        }
    

    My five first byte are 60, 118, 56, 46 and 49.

    Is there a chart that shows which encoding matches those five first bytes?

    • Mark Byers
      Mark Byers over 13 years
      The byte order mark should not be used to detect encodings. There are cases when it is ambiguous which encoding is used: UTF-16 LE, and UTF-32 LE both start with the same two bytes. The BOM should only be used to detect byte order (hence its name). Also UTF-8 strictly speaking should not even have a byte order mark and adding one can interfere with some software that doesn't expect it.
    • Cédric Boivin
      Cédric Boivin over 13 years
      @Mark Bayers, so it's there a way i can detech witch encoding are use in my file ?
    • Dan W
      Dan W over 11 years
      @Mark Byers: UTF-32 LE starts with the same 2 bytes as UTF-16 LE. However, it also follows with bytes 00 00 which is (I think very) unlikely in UTF-16 LE. Also, the BOM in theory should indicate as you say, but in practice, it acts as a signature to show what encoding it. See: unicode.org/faq/utf_bom.html#bom4
    • Nyerguds
      Nyerguds about 8 years
      Is the UTF7 BOM actually a real thing? I tried making a UTF7Encoding object and perform GetPreamble() on it, and it returned an empty array. And unlike utf8 it doesn't have a constructor parameter for it.
    • Elmue
      Elmue over 7 years
      Mark Beyers: Your comment is COMPLETELY wrong. The BOM is a bullet proof way to detect encoding. UTF16 BE and UTF32 BE are not ambiguous. You should study the topic before writing wrong comments. If a software does not handle UTF8 BOM then this software is either fom the 1980's or badly programmed. Today every software should handle and recognize BOM's.
    • TarmoPikaro
      TarmoPikaro over 7 years
    • jstine
      jstine over 6 years
      Elmue has clearly never used batch filtering, concatenation, and pipe redirection of streams of plaintext files. It is unrealistic to handle/support BOMs in such scenarios.
  • Cédric Boivin
    Cédric Boivin over 13 years
    Not work, the StreamReader suppose that your file is in UTF-8
  • Phil Hunt
    Phil Hunt over 13 years
    @Cedric: Check MSDN for this constructor. Do you have evidence that the constructor doesn't work consistently with the documentation? Granted, that is possible in Microsoft's docs :-)
  • Cédric Boivin
    Cédric Boivin over 13 years
    Hummmm ... so i need to test all ?
  • Cédric Boivin
    Cédric Boivin over 13 years
    sorry, your rigth. But it's dont work :-( the encoding is not good
  • Ira Baxter
    Ira Baxter almost 12 years
    Can you clarify your analysis of UTF-8 above? I think you are saying that if you have a random [flat] distribution of characters from which the file is made, you have low probabilities of getting confused. As a practical matter, no text files have flat distributions like this... so I'd expect a severe impact on the analysis with false positive rate being much higher. How can you distinguish between UTF-16 and UTF-8 if the files are an even number of bytes?
  • dan04
    dan04 almost 12 years
    Yes, it's for a random distribution of octets. For real data, it's more difficult to calculate. But the point is, for a legacy-encoded (e.g., windows-1252) file to get misinterpreted as being UTF-8, it would have to contain weird character sequences like ’.
  • Ira Baxter
    Ira Baxter almost 12 years
    OK, what I expected. Can you address distinguishing UTF-8/UTF-16? PS: Thanks for a very helpful answer. +1
  • dan04
    dan04 almost 12 years
    You can't detect UTF-16 by validation like you can with UTF-8 because the false positive rate is much higher. For example, about 93.8% of random 4-byte sequences happen to be valid UTF-16, the only invalid ones being noncharacters and unpaired surrogates. And 100% of even-length ASCII strings happen to be valid UTF-16 (even if it's a nonsensical Chinese sentence like 畂桳栠摩琠敨映捡獴). The only reliable way to detect UTF-16 is to look for the BOM, FE FF or FF FE. You'd still get a false positive for Latin1 þÿ or ÿþ, but those are unlikely combinations.
  • Daniel Bişar
    Daniel Bişar almost 12 years
    This version also only checks for BOM
  • Dan W
    Dan W over 11 years
    For UTF-16BE text files, if a certain percentage of even bytes are zeroed (or check odd bytes for UTF-16LE), then there's a good probability that the encoding is UTF-16. What do you think?
  • dan04
    dan04 over 11 years
    Yes, that would work most of the time. You might still get some false negatives from files that contain only non-Latin1 characters.
  • EProgrammerNotFound
    EProgrammerNotFound about 11 years
    @dan04 what is the | in 0x|10?? it is a constant value?
  • dan04
    dan04 about 11 years
    An "or" operator. And the "x" indicates any hexadecimal digit. That is, that byte can be any of the 17 values between hex 00 and hex 10, inclusive.
  • Carl Walsh
    Carl Walsh over 10 years
    Um, don't you have to call Read() before reading CurrentEncoding? The MSDN for CurrentEncoding says "The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method."
  • Geoffrey McGrath
    Geoffrey McGrath about 10 years
    My testing shows that this cannot be used reliably, therefore should not be used at all.
  • Nyerguds
    Nyerguds over 8 years
    Actually, Encoding.GetEncoding("Windows-1252") gives a different object class than Encoding.ASCII. While debugging, Windows-1252 shows as a System.Text.SBCSCodePageEncoding object, whereas ascii is a System.Text.ASCIIEncoding object. I never use the ASCII one when I need Windows-1252
  • Nyerguds
    Nyerguds over 8 years
    UTF-8 validity can nicely be detected by doing bit pattern checks; the first byte's bit pattern accurately tells you how many bytes will follow, and the following bytes also have control bits to check. The pattern are all shown here: ianthehenry.com/2015/1/17/decoding-utf-8
  • Nyerguds
    Nyerguds about 8 years
    That's just pure ascii. UTF-8 without special characters simply equals ASCII, and if there are special characters, those use specific detectable bit patterns.
  • marsze
    marsze over 7 years
    I am surprised this answer doesn't mention Windows ANSI, considering how common it is.
  • AbeyMarquez
    AbeyMarquez over 6 years
    From the MSDN docs, the StreamReader overload you mention "[...] initializes the encoding to UTF8Encoding" then the second param, detectEncodingFromByteOrderMarks, "[...] automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks." It will first default to UTF8 and then it won't autodetect all encodings, so this is not what OP's looking for. (msdn.microsoft.com/en-us/library/7bc2hwcb(v=vs.110).aspx)
  • Nyerguds
    Nyerguds over 5 years
    @marsze That's because it's a dumb one-byte-per-symbol 8-bit encoding. Every byte is technically valid in it (though most of the < 0x20 range are not advised), so it can't be detected except through language heuristics... which is a whole other mess. Not to mention, ANSI isn't one encoding; there's a Windows ANSI encoding for every language region.
  • marsze
    marsze over 5 years
    @Nyerguds I didn't say it's a good encoding. I said it's very commonly used and I'm therefore surprised you didn't mention it. The points you just made replying on my comment would have been very helpful included in your answer.
  • Nyerguds
    Nyerguds over 5 years
    @marsze This isn't my answer... and it's not mentioned because this is about detection, and, as I mentioned, you can't really detect simple one-byte-per-symbol encodings. I have personally posted an answer on this place about (vaguely) identifying it, though.
  • dan04
    dan04 over 5 years
    @marsze: There, I've added a section for Latin-1.
  • marsze
    marsze over 5 years
    Sorry @Nyerguds you just defended the answer too enthusiastically I guess. And dan04 thanks! Reads well.
  • Amit
    Amit almost 5 years
    @Nyerguds maybe not. I have an UTF-8 text file (without "specific detectable bit patterns" - and mostly all English characters). If I read it with ASCII, it fails to read one particular "-" symbol.
  • Nyerguds
    Nyerguds almost 5 years
    Impossible. If the character is not ascii, then it will be encoded using those specific detectable bit patterns; that's how utf-8 works. More likely, your text is neither ascii nor utf-8, but just an 8-bit encoding like Windows-1252.
  • Amr Ali
    Amr Ali almost 4 years
    To match regular expressions against binary data (bytes), the correct method is: string data = Encoding.GetEncoding("iso-8859-1").GetString(bytes); Because it is the only single-byte encoding that has 1-to-1 byte mapping to string.
  • Michael Hutter
    Michael Hutter over 2 years
    How can I use this function if I only have a file name? I need a function like this: public static Encoding DetectEncoding(string sFilename)
  • Alex from Jitbit
    Alex from Jitbit over 2 years
    @MichaelHutter use File.Open(sFilename) to get the file stream. Then proceed
  • Michael Hutter
    Michael Hutter over 2 years
    File.Open(sFilename) opens a file and determines the Encoding according to the BOM inside the file. If the BOM is missing it may make a mistake by assuming a wrong Encoding. This answer is doing the same "mistake". It only works if there is a BOM. In case if there is no BOM inside the file, it is necessary to analyse the whole file content like it is done here: stackoverflow.com/a/69312696/9134997
  • Elmue
    Elmue over 2 years
    This answer answers the question that was asked by Cedrik. There is no mistake in the answer. Your mistake is that you did not read the question. Any detection on the file's text content without BOM will NEVER be reliable.