How can I see which encoding is used in a file

81,647

Solution 1

You can not really automatically find out whether a file was written with encoding X originally.

What you can easily do though is to verify whether the complete file can be successfully decoded somehow (but not necessarily correctly) using a specific codec. If you find any bytes that are not valid for a given encoding, it must be something else.

The problem is that many codecs are similar and have the same "valid byte patterns", just interpreting them as different characters. For example, an ä in one encoding might correspond to é in another or ø in a third. The computer can't really detect which way to interpret the byte results in correctly human readable text (unless maybe if you add a dictionary for all kinds of languages and let it perform spell checks...). You must also know that some character sets are actually subsets of others, like e.g. the ASCII encoding is a part of most commonly used codecs like some of the ANSI family or UTF-8. That means for example a text saved as UTF-8 that only contains simple latin characters, it would be identical to the same file saved as ASCII.


However, let's get back from explaining what you can't do to what you actually can do:

For a basic check on ASCII / non-ASCII (normally UTF-8) text files, you can use the file command. It does not know many codecs though and it only examines the first few kB of a file, assuming that the rest will not contain any new characters. On the other hand, it also recognizes other common file types like various scripts, HTML/XML documents and many binary data formats (which is all uninteresting for comparing text files though) and it might print additional information whether there are extremely long lines or what type of newline sequence (e.g. UNIX: LF, Windows: CR+LF) is used.

$ cat ascii.txt 
I am an ASCII file.
Just text and numb3rs and simple punctuation...

$ cat utf8.txt 
I am a Unicode file.
Special characters like Ω€®Ŧ¥↑ıØÞöäüß¡!

$ file ascii.txt utf8.txt 
ascii.txt: ASCII text
utf8.txt:  UTF-8 Unicode text

If that is not enough, I can offer you the Python script I wrote for this answer here, which scans complete files and tries to decode them using a specified character set. If it succeeds, that encoding is a potential candidate. Otherwise if there are any bytes that can not be decoded with it, you can remove that character set from your list.

Solution 2

A program named file can do this. Example:

$ echo aaa >> FILE
$ file FILE
FILE: ASCII text, with CRLF, LF line terminators
$ echo öäü >> FILE
$ file FILE
FILE: UTF-8 Unicode text, with CRLF, LF line terminators

If you're interested in how it's done see src/encoding.c.

Solution 3

If you're looking for an alternative to file I really recommend detect-file-encoding-and-language!

The downside is that it requires some extra steps. You have to have Node.js and NPM installed in order to be able to use it.

You can install Node.js and NPM like this:

$ sudo apt install nodejs npm

Then install detect-file-encoding-and-language:

$ npm install -g detect-file-encoding-and-language

Finally, detect the encoding like so:

$ dfeal "/home/user name/Documents/subtitle file.srt"
Share:
81,647

Related videos on Youtube

NonStandardModel
Author by

NonStandardModel

Work: Testing gambling industry software. University: Physics student. Sport: Going to the mountains.

Updated on September 18, 2022

Comments

  • NonStandardModel
    NonStandardModel about 1 year

    I had some problems with subtitle files in video omxplayer. To solve it I had to convert from windows-1250 to UTF-8 encoding. My question is, how can I see for some specific file which encoding is used?

    • Rinzwind
      Rinzwind almost 7 years
      piconv to change the encoding ;)
    • NonStandardModel
      NonStandardModel almost 7 years
      Yes. I have already changed the encoding (in 1 file). But I have many of these and wanted to make small script that would control all of them and then covert if needed. But I suppose that I can just convert all of them. No harm will be done if some is already in UTF-8. Right?
    • Rinzwind
      Rinzwind almost 7 years
      Not a problem no :) Just use a wildcard
  • hobbs
    hobbs almost 7 years
    It can guess, anyway.
  • Mark
    Mark almost 7 years
    file makes a guess, and frequently it isn't a very good one. For example, in my testing, it mis-identified both MacRoman and CP-1252 as ISO-8859, with the result that "š" and "ß" were scrambled.
  • kasperd
    kasperd almost 7 years
    ANSI is not really the name of any character encoding. Perhaps you are thinking of ANSI escape codes, which can be expressed with the ASCII character encoding.
  • user1937198
    user1937198 almost 7 years
    @kasperd Most likely he is refering to one of the ISO 8859 or Window code page family. For north american windows developers, ANSI encoding often means the Windows 1252 encoding for historical reasons.
  • Byte Commander
    Byte Commander almost 7 years
    Yeah, well, ANSI is basically ASCII (codes 0-127) plus a locale-specific codepage (codes 128-255). So you're right...
  • wjandrea
    wjandrea over 6 years
  • Amirreza Nasiri
    Amirreza Nasiri almost 4 years
    Great 👌! Spent some hours to find correct encoding of my old .sql file and file showed my that it's indeed a gzip compressed file!
  • Seamus
    Seamus almost 3 years
    I like this answer because it's honest. I've just wasted too much time trying to sort the encoding on what turned out to be encoded using "Unicode (UTF-16 Little Endian)". I never even knew such a thing existed!! The file utility was a big disappointment - its conclusion was charset=binary :) I wish you would write a replacement for file!!
  • Kaspacainoombro
    Kaspacainoombro over 2 years
    this does not work for a windows-1252 file.
  • Kaspacainoombro
    Kaspacainoombro over 2 years
    file -i does not work for windows-1252. It returns unknown-8bit