Determining ISO-8859-1 vs US-ASCII charset

11,913

Solution 1

If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.

ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).

However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.

TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).

Solution 2

It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.

If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

Share:
11,913
vikingsteve
Author by

vikingsteve

A Java enthusiast from Australia specializing in enterprise integration and cloud / big data technologies. Passionate aboute applying modern software architecture principles to produce software that is scalable, reliable and maintainable. My philosophy is to use technology to simplify things (not complicate), test early and often, use the right tools, explain clearly what your goals are and always ensure quality in what you deliver. Have you seen my plugins for IntelliJ IDEA? Search 'vikingsteve' in the Jetbrains Plugin Repository.

Updated on June 04, 2022

Comments

  • vikingsteve
    vikingsteve almost 2 years

    I am trying to determine whether to use

    PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
    

    or

    PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
    

    I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.

    When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"

    file -bi example.txt
    

    However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".

    file -bi example-no-european-letters.txt
    

    What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?

    Should I just use a charset "ISO-8559-1" and everything will be ok?