Java compiler platform file encoding problem

java character-encoding javac

14,620

Solution 1

There are no such things like a a String that was encoded as ISO-8859-1 in memory. Java Strings in memory are always Unicode strings. (Encoded in UTF-16 (as of 2011 – I think it changed with later Java versions), but you don't really need to now this).

The encoding comes only in play when you input or output the string - then, given no explicit encoding, it uses the system default (which on some systems depends on user settings).

As said by McDowell, the actual encoding of your source file should be matched by the encoding which your compiler assumes about your source file, otherwise you get problems as you observed. You can achieve this by several means:

Use the -encoding option of the compiler, giving the encoding of your source file. (With ant, you set the encoding= parameter.)
Use your editor or any other tool (like recode) to change the encoding of your file to the compiler default.
use native2ascii (with the right -encoding option) to translate your source file to ASCII with \uXXXX-escapes.

In the last case, you later can compile this file everywhere with every default encoding, so this may be the way to go if you give the sourcecode to encoding-unaware persons to compile somewhere.

If you have a bigger project consisting of more than one file, they should all have the same encoding, since the compiler has only one such switch, not several.

In all projects I had in the last years, I always encode all my files in UTF-8, and in my ant buildfile set the encoding="utf-8" parameter to the javac task. (My editor is smart enough to automatically recognize the encoding, but I set the default to UTF-8.)

The encoding matters to other source-code handling tools to, like javadoc. (There you should additionally the -charset and -docencoding options for the output - they should match, but can be different to the source--encoding.)

Solution 2

I'd hazard a guess that there is a transcoding issue during the compilation stage and the compiler lacks direction as to the encoding of a source file (e.g. see the javac -encoding switch).

Compilers generally use the system default encoding if you aren't specific which can lead to string and char literals being corrupted (internally, Java bytecode uses a modified UTF-8 form, so binaries are portable). This is the only way I can imagine that problems are being introduced at compile time.

I've written a bit about this here.

Solution 3

I've had similar issues when using variable names that aren't ascii (Σ, σ, Δ, etc) when doing math formula. On linux, it used UTF-8 encoding while interpreting. On windows it complained about invalid names because windows uses ISO-LATIN-1. The solution was to specify the encoding in the ant script I used to compile these files.

Solution 4

Always use escape codes (e.g \uxxxx) in your source files and this will not be a problem. @Paulo mentioned this, but i wanted to call it out explicitly.

View more solutions

14,620

Author by

Richard Brewster

Updated on June 14, 2022

Comments

Richard Brewster almost 2 years

Recently I encountered a file character encoding issue that I cannot remember ever having faced. It's quite common to have to be aware of character encoding of text files and write code that handles encoding correctly when run on different platforms. But the problem I found was caused by compilation on a different platform from the execution platform. That was entirely unexpected, because in my experience when javac creates a class file, the important parameters are the java source and target params, and the version of the JDK doing the compile. I my case, classes compiled with JDK 1.6.0_22 on Mac OS X behaved differently than classes compiled with 1.6.0_23-b05 on Linux, when run on Mac OS X. The specified source and target were 1.4.

A String that was encoded as ISO-8859_1 in memory was written to disk using a PrintStream println method. Depending on which platform the Java code was COMPILED on, the string was written differently. This lead to a bug. The fix for the bug was to specify the file encoding explicitly when writing and reading the file.

What surprised me was that the behavior differed depending on where the classes were compiled, not on which platform the class was run. I'm quite familiar with Java code behaving differently when run on different platforms. But it is a bit scary when the same code, compiled on different platforms, runs differently on the same platform.

Has anyone encountered this specific problem? It would seem to bode ill for any Java code that reads and writes strings to file without explicitly specifying the character encoding. And how often is that done?
Paŭlo Ebermann over 13 years

Nice, I think usually people would write Sigma (or sum), sigma, delta and so on instead of using the right greek letters. I once created a variable named ℕ. I wanted to call it ℕ₀, but javac did not accept this, since ₀ is not a digit for Java.
Richard Brewster over 13 years

This doesn't have to do with source encoding. No string literals are involved. A string is read from a network connection and then written to a file. What I meant by 'encoded in memory as ISO-8859-1' is that the input stream is read using that character set, because that is how it's encoded.
Richard Brewster over 13 years

"given no explicit encoding, it uses the system default" Yes, but the system default of the runtime VM, right? In this case the encoding was apparently determined by the compile platform. A PrintStream behaves differently, depending on the compile platform. This is not portable behavior. Do you see my point yet?
Paŭlo Ebermann over 13 years

I think we need a minimal example for your code. This looks like the two compilers on the two systems selected different methods.
KitsuneYMG over 13 years

@Paŭlo Ebermann The issue I had was that there were so many variables and the equations complex enough that documentation was a PITA. The I used the special characters and the documentation/proof of correctness was "See: skolnik, pp XXX-XXX". The fact that the variable were the same as the text made it much easier for others to understand.
Richard Brewster over 13 years

Sorry, NDA prevents including source.
Paŭlo Ebermann over 13 years

This was why I said minimal example ... minimize your code until either the problem disappears (then you have found the culprit), or until there is nothing secret left (and still the problem).