Detect UTF-16 file content

11,602

Solution 1

Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.

You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.

Solution 2

You may be able to read a byte-order-mark, if the file has this present.

Solution 3

UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.

Solution 4

First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.

The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.

You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.


Edit in response to OP's comment:

I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).

My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).

Solution 5

Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.

Share:
11,602
Franck Freiburger
Author by

Franck Freiburger

Computer enthusiast, I followed my studies in this field and gained twenty years of experience in business as a confirmed software engineer in web and desktop technologies. I have also realized many personal innovative projects and also open-source such as jslibs. The management of these projects made me want to become independent and to create my own company in order to vary my projects and offer my expertise in computing. I propose the development of professional applications and websites thanks to innovative technologies like HTML5, CSS3, ES6 and Node.js, guaranteeing an optimal and tailored result.

Updated on July 22, 2022

Comments

  • Franck Freiburger
    Franck Freiburger almost 2 years

    Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?

  • Amit Patil
    Amit Patil over 14 years
    Unfortunately Microsoft have really confused this issue by consistently calling the UTF-16LE encoding “Unicode”.
  • Franck Freiburger
    Franck Freiburger over 14 years
    I have to choose between JS_CompileScript() and JS_CompileUCScript() to compile JavaScript files for my native embedding (code.google.com/p/jslibs)
  • Victor Engel
    Victor Engel almost 7 years
    Unicode is not an alphabet. It is an encoding, which encodes many alphabets. Think of it as a mapping from alphabets to a representation of those alphabets in digital form.
  • Gustaf Liljegren
    Gustaf Liljegren over 6 years
    Unicode is neither an alphabet nor an encoding, but a coded character set, offering multiple character encodings (UTF-8, UTF-16 and UTF-32).
  • Victor Engel
    Victor Engel over 6 years
    Shall I disagree one more time? It's not an alphabet, encoding, or a coded character set as ISO/EIC 10646 is, but a standard for encoding, handling, and representation of writing systems. In addition to the character set, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. en.wikipedia.org/wiki/…