How to determine the encoding of a CSV file?
Note: In general, identifying the original encoding of a text file is not a deterministic problem. If there are no metadata (eg. an HTML content-type header), you can only guess. There are tools and libraries out there that help you guessing – and some of them do a pretty good job – but you can't be 100% sure. This is especially true if 8-bit encodings (like Latin-1, Windows CP1252 etc.) are involved.
But if you already know that the encoding must be either UTF-8 or UTF-16, then you're in a good situation.
UTF-16-encoded text files must always begin with a BOM. You can use this fact to detect its presence. There are two different "flavors" of UTF-16 – Big Endian (BE) and Low Endian (LE). Since UTF-16 uses two-byte words (16 bits), there are two ways to compose them: high-byte first (BE) or low-byte first (LE). You can tell from the BOM, ie. by looking at the very first two bytes of the file:
-
FE FF
→ UTF-16 BE -
FF FE
→ UTF-16 LE
For UTF-8, a BOM is not strictly needed – in fact, using it is actually non-standard. However, the fact that many Windows application have continuously refused to recognise UTF-8 encoding unless it contains a BOM led to a pseudo-standard "UTF-8 with BOM". If the BOM is present, it occupies the first three bytes of the file:
-
EF BB BF
→ UTF-8 with BOM
If your file starts with something different, then you either have BOM-less UTF-8, or some non-UTF encoding (ASCII, Latin-1...).
Bornegio
Updated on June 05, 2022Comments
-
Bornegio almost 2 years
I'm writting script that has to make some operations on CSV file, but I have no idea if file will be encoded with utf-8 or utf-16. How to check if given csv file cointains utf-16 BOM?
-
martineau about 5 yearsSounds like may be impossible — see How to determine the encoding of text?
-
Giacomo Catenazzi about 5 yearsUTF-16 is not much used to exchange data. Try with an editor (or a browser) and check different encoding: when you see good data, it could be the correct encoding. If you see many 00 bytes, it is nearly certain a UTF-16 (or other 16 or more bits encoding). [a csv file need to have a comma, so U+002C, so in this case you have to have the 00 byte]
-
Tom Blodget about 5 yearsIt might be more straightforward to tell the sender that you only accept UTF-8 (or whatever). Or accept a file format where the character encoding is not separated from the file, like .xlsx.
-