What's the best way to identify unicode encoded text files in Windows?

27,171

Solution 1

See “How to detect the character encoding of a text-file?” or “How to reliably guess the encoding [...]?”

  • UTF-8 can be detected with validation. You can also look for the BOM EF BB BF, but don't rely on it.
  • UTF-16 can be detected by looking for the BOM.
  • UTF-32 can be detected by validation, or by the BOM.
  • Otherwise assume the ANSI code page.

Our codebase doesn't include any non-ASCII chars. I will try to grep for the BOM in files in our codebase. Thanks for the clarification.

Well that makes things a lot simpler. UTF-8 without non-ASCII chars is ASCII.

Solution 2

Unicode is a standard, it is not an encoding. There are many encodings that implement Unicode, including UTF-8, UTF-16, UCS-2, and others. The translation of any of these encodings to ASCII depends entirely on what encoding your "different editors" use.

Some editors insert byte-order marks of BOMs at the start of Unicode files. If your editors do that, you can use them to detect the encoding.

ANSI is a standards body that has published several encodings for digital character data. The "ANSI" encoding used by MS DOS and supported in Windows is actually CP-1252, not an ANSI standard.

Does your codebase include non-ASCII characters? You may have better compatibility using a Unicode encoding rather than an ANSI one or CP-1252.

Solution 3

Actually, if you want to find out in windows if a file is unicode, simply run findstr on the file for a string you know is in there.

findstr /I /C:"SomeKnownString" file.txt

It will come back empty. Then to be sure, run findstr on a letter or digit you know is in the file:

FindStr /I /C:"P" file.txt

You will probably get many occurrences and the key is that they will be spaced apart. This is a sign the file is unicode and not ascii.

Hope this helps.

Solution 4

If you're looking for a programmatic solution, IsTextUnicode() might be an option.

Share:
27,171
HOCA
Author by

HOCA

i like to program :)

Updated on August 05, 2022

Comments

  • HOCA
    HOCA over 1 year

    I am working on a codebase which has some unicode encoded files scattered throughout as a result of multiple team members developing with different editors (and default settings). I would like to clean up our code base by finding all the unicode encoded files and converting them back to ANSI encoding.

    Any thoughts on how to accomplish the "finding" part of this task would be truly appreciated.