Determining ISO-8859-1 vs US-ASCII charset
Solution 1
If the file contains only the 7-bit US-ASCII
characters it can be read as US-ASCII
. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå
characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).
Solution 2
It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.
If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..
vikingsteve
A Java enthusiast from Australia specializing in enterprise integration and cloud / big data technologies. Passionate aboute applying modern software architecture principles to produce software that is scalable, reliable and maintainable. My philosophy is to use technology to simplify things (not complicate), test early and often, use the right tools, explain clearly what your goals are and always ensure quality in what you deliver. Have you seen my plugins for IntelliJ IDEA? Search 'vikingsteve' in the Jetbrains Plugin Repository.
Updated on June 04, 2022Comments
-
vikingsteve almost 2 years
I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?