How can I detect the encoding/codepage of a text file

309,679

Solution 1

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Solution 2

If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection (same link, with better formatting via Wayback Machine).

Solution 3

Have you tried C# port for Mozilla Universal Charset Detector

Example from http://code.google.com/p/ude/

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}    

Solution 4

You can't detect the codepage

This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.

Given a reasonable amount of text, it is even possible to detect the language.

Here's another one I just found using Google:

Solution 5

I know it's very late for this question and this solution won't appeal to some (because of its english-centric bias and its lack of statistical/empirical testing), but it's worked very well for me, especially for processing uploaded CSV data:

http://www.architectshack.com/TextFileEncodingDetector.ashx

Advantages:

  • BOM detection built-in
  • Default/fallback encoding customizable
  • pretty reliable (in my experience) for western-european-based files containing some exotic data (eg french names) with a mixture of UTF-8 and Latin-1-style files - basically the bulk of US and western european environments.

Note: I'm the one who wrote this class, so obviously take it with a grain of salt! :)

Share:
309,679

Related videos on Youtube

user1149201
Author by

user1149201

Programmer since 1982: Currently feeling comfortable in C#, Java, Swift, Kotlin, SQL, HTML, C++, ...

Updated on February 13, 2022

Comments

  • user1149201
    user1149201 over 2 years

    In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

    Is there a way to (automatically) detect the codepage of a text file?

    The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850, windows1252.


    Thanks for your answers, this is what I've done.

    The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

    Solution:

    • Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
    • I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
    • Loop through all codepages, and display the ones that give a solution with the user provided text.
    • If more as one codepage pops up, ask the user to specify more text.
  • JV.
    JV. over 15 years
    "heuristics" - so the browser isn't quite detecting it, it's making an educated guess. "works really well" - so it doesn't work all the time then? Sounds to me like we're in agreement.
  • Jon Trauntvein
    Jon Trauntvein over 15 years
    The standard for HTML dictates that, if the character set is not defined by the document, then it should be considered to be encoded as UTF-8.
  • sina
    sina over 15 years
    Funnily enough my Firefox 3.05 installation detects that page as UTF-8, showing a number of question-mark-in-a-diamond glyphs, although the source has a meta tag for Windows-1252. Manually changing the character encoding shows the document correctly.
  • Tao
    Tao about 13 years
    Your sentence "If you're looking to detect non-UTF encodings (i.e. no BOM)" is slightly misleading; the unicode standard does not recommend adding a BOM to utf-8 documents! (and this recommendation, or lack thereof, is the source of many headaches). ref: en.wikipedia.org/wiki/Byte_order_mark#UTF-8
  • Kos
    Kos over 11 years
    Which is cool unless we're reading non-standard HTML documents. Or non-HTML documents.
  • sashoalm
    sashoalm almost 11 years
    This is done so you can concatenate UTF-8 strings without accumulating redundant BOMs. Besides, a Byte-Order Mark is not needed for UTF-8, unlike UTF-16 for example.
  • geneorama
    geneorama over 10 years
    I downvoted this answer for two reasons. First, saying that "you need to be told" is not helpful. Who would tell me, and through what medium would they do so? If I'm the one who saved the file, who would I ask? Myself? Second, the article is not especially helpful as a resource for answering the question. The article is more of a history of encoding written in a David Sedaris style. I appreciate the narrative, but it doesn't simply / directly answer the question.
  • JV.
    JV. over 10 years
    @geneorama, I think Joel's article addresses your questions better than I ever could, but here goes... The medium surely depends on the environment in which the text is received. Better that the file (or whatever) contains that information (I'm thinking HTML and XML). Otherwise the person sending the text should be allowed to supply that information. If you were the one who created the file, how can you not know what encoding it uses?
  • JV.
    JV. over 10 years
    @geneorama, continued... Finally, I suppose the main reason the article doesn't answer the question simply is because there is no simple answer to that question. If the question were "How can I guess..." then I would have answered differently.
  • geneorama
    geneorama over 10 years
    @JV I later learned that xml/html can specify character encoding, thanks for mentioning that useful tidbit.
  • geneorama
    geneorama over 10 years
    @JV "Create a file" may be a poor choice of words. I assume that a user can specify the encoding of a file that the user generates. Recently I "created" a file from a Hadoop Cluster using Hive, and passed it to an FTP before downloading it to various client machines. The result had some unicode garbage in it, but I don't know which step created the issue. I didn't ever explicitly specify the encoding. I wish that I could check the encoding at each step.
  • Erik Aronesty
    Erik Aronesty over 10 years
    Use the tool "uchardet". It does this. Larger files have more confidence (obviously), up to 6 9's. Naysaying isn't an answer.
  • seebiscuit
    seebiscuit almost 10 years
    Worked flawlessly for Windows-1252 type.
  • Bartosz
    Bartosz over 9 years
    And how can you use it to read a text file to string using that? CharsetDetector returns the name of the encoding in string format and that's it...
  • SurajS
    SurajS about 9 years
    It's just "encoding" link here.. and the description says we have to provide the Encoding..
  • leppie
    leppie about 9 years
    @SurajS: Look at the other overloads.
  • ibondre
    ibondre almost 9 years
    the original author wants to detect the encoding for a file, which would potentially not have the BOM Marker. The StreamReader detects encoding from BOM Header as per signature. public StreamReader( Stream stream, bool detectEncodingFromByteOrderMarks )
  • PrivatePyle
    PrivatePyle over 8 years
    @Bartosz private Encoding GetEncodingFromString(string encoding) { try { return Encoding.GetEncoding(encoding); } catch { return Encoding.ASCII; } }
  • tripleee
    tripleee over 8 years
    @JonTrauntvein This depends on which revision of the HTML standard you are reading. It used to declare ISO-8859-1 as the default; if the HTML declares itself to be in a version which had this default, then that's obviously what it means.
  • tripleee
    tripleee over 8 years
    @JonTrauntvein Furthermore, I could not quickly corroborate your claim for HTML 5, either. I see a massive amount of heuristics to maintain interoperability with legacy encodings, and nothing to finally settle the matter if the heuristics fail. See e.g. html.spec.whatwg.org/multipage/…
  • tripleee
    tripleee over 8 years
    @JohnMachin I agree that it is rare, but it is mandated e.g. in some parts of the IMAP protocol. If that's where you are, you would not have to guess, though.
  • ViRuSTriNiTy
    ViRuSTriNiTy almost 8 years
    Very nice solution. One can easily wrap the body of ReadAsString() in a loop of allowed encodings if more than 2 encodings (UTF-8 and ASCI 1252) should be allowed.
  • z80crew
    z80crew about 7 years
    This answer is wrong, so I had to downvote. Saying it'd be false that you cannot detect the codepage, is wrong. You can guess and your guesses can be rather good, but you cannot "detect" a codepage.
  • z80crew
    z80crew about 7 years
    @JonTrauntvein According to the HTML5 specs a character encoding declaration is required even if the encoding is US-ASCII – a lacking declaration results in using a heuristic algorithm, not in falling back to UTF8.
  • Paul B
    Paul B about 7 years
    On Mac via homebrew: brew install uchardet
  • SedJ601
    SedJ601 over 6 years
    After trying tons of examples, I finally got to yours. I am in a happy place right now. lol Thank!!!!!!!
  • chuckc
    chuckc over 4 years
    This may not be the answer to how to detect 1252 vs 1250, but it should absolutely be the answer for "How to detect UTF-8" with or without a BOM !!
  • yucer
    yucer about 4 years
    But it is till useful for support / debug. Suppose you everybody did agree that the input files of a process will be with a given encoding. But then the user complains that your system is failing to read the file. Both parts need a library or tool to prove the other that the file contains the intended encoding. Or even your software can make the validation first and give some message like "the encoding of the file order_2345.xml is not UTF-8"
  • Nyerguds
    Nyerguds over 3 years
    @chuckc There is no decent way to detect between different no-BOM one-byte-per-symbol encodings. At that level, you're purely down to heuristics.