Should source code be saved in UTF-8 format

java eclipse encoding utf-8

20,857

Solution 1

What is your goal? Balance your needs against the pros and cons of this choice.

UTF-8 Pros

allows use of all character literals without \uHHHH escaping

UTF-8 Cons

using non-ASCII character literals without \uHHHH increases risk of character corruption
- font and keyboard issues can arise
- need to document and enforce use of UTF-8 in all tools (editors, compilers build scripts, diff tools)
beware the byte order mark

ASCII Pros

character/byte mappings are shared by a wide range of encodings
- makes source files very portable
- often obviates the need for specifying encoding meta-data (since the files would be identical if they were re-encoded as UTF-8, Windows-1252, ISO 8859-1 and most things short of UTF-16 and/or EBCDIC)

ASCII Cons

limited character set
this isn't the 1960s

Note: ASCII is 7-bit, not "extended" and not to be confused with Windows-1252, ISO 8859-1, or anything else.

Solution 2

Important is at least that you need to be consistent with the encoding used to avoid herrings. Thus not, X here, Y there and Z elsewhere. Save source code in encoding X. Set code input to encoding X. Set code output to encoding X. Set characterbased FTP transfer to encoding X. Etcetera.

Nowadays UTF-8 is a good choice as it covers every character the human world is aware of and is pretty everywhere supported. So, yes, I would set workspace encoding to it as well. I also use it so.

Solution 3

Eclipse's default setting of using the platform default encoding is a poor decision IMHO. I found it necessary to change the default to UTF-8 shortly after installing it because some of my existing source files used it (probably from snippets copied/pasted from web pages.)

The Java Language and API specs require UTF-8 support so you're definitely okay as far as the standard tools go, and it's a long time since I've seen a decent editor that did not support UTF-8.

Even in projects that use JNI, your C sources will normally be in US-ASCII which is a subset of UTF-8 so having both open in the same IDE will not be a problem.

Solution 4

Yes, unless your compiler/interpreter is not able to work with UTF-8 files, it is definitely the way to go.

Solution 5

I don't think there's really a straight yes or no answer to this question. I would say that the following guidelines should be used to pick an encoding format, in order of priority listed (highest to lowest):

1) Pick an encoding your tool chain supports. This is a lot easier than it used to be. Even in recent memory a lot of compilers and languages basically only supported ASCII, which more or less forced developers into coding in Western European languages. These days, many of the newer languages support other encodings, and almost all decent editors and IDEs support a tremendously long list of encodings. Still... there are just enough holdouts that you need to double check before you settle on an encoding.

2) Pick an encoding that supports as many of the alphabets you wish to use as possible. I place this as a secondary priority because frankly, if your tools don't support it it doesn't really matter whether you like the encoding better or not.

UTF-8 is an excellent choice in many circumstances of today's world. It's an ugly, inelegant format, but it solves a whole host of problems (namely dealing with legacy code) that break other encodings, and it seems to becoming more and more the de facto standard of character encodings. It supports every major alphabet, darn near every editor on the planet supports it now, and a whole host of languages/compilers support it, too. But as I mentioned above, there are just enough legacy holdouts that you need to double check your tool chain from end to end before you settle on it definitively.

View more solutions

20,857

Author by

JARC

Updated on February 02, 2020

Comments

JARC about 4 years

How important is it to save your source code in UTF-8 format?

Eclipse on Windows uses CP1252 character encoding by default. The CP1251 format means non UTF-8 characters can be saved and I have seen this happen if you copy and paste from a Word document for a comment.

The reason I ask is because out of habit I set-up Maven encoding to be in UTF-8 format and recently it has caught a few non mappable errors.

(update) Please add any reasons for doing so and why, are there some common gotchas that should be known?

(update) What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer, right now I don't.
BalusC about 14 years

...which in javac can be controlled with -encoding argument by the way. Good point though, +1.
JARC about 14 years

What herrings? If source is built on Windows and executed on *nix would that be a good reason to define your encoding?
JARC about 14 years

What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer - thanks for the post.
JARC about 14 years

I assume these are rare but very possible.
BalusC about 14 years

For example, yes. Default encoding namely differs at both platforms. This does not affect technical functionality of Java code in any way however (Java literals/keywords are namely already part of ASCII, which is basically the base of all other encodings (expect of EBCDIC, but that's a different story)), but it may result in erroneous input/output.
penpen about 14 years

No, Java identifier are not necessarily only Ascii char. This is a valid int declaration (at least javac and eclipse accept that): int é\u1212;
BalusC about 14 years

@penpen: I was talking about literals/keywords like public, class, null, etc, not about identifiers.
Cowan about 14 years

Strongly disagree with the "ugly, inelegant format" part. UTF-8 is pretty much a masterpiece as far as I'm concerned: backwards-compatible, more space-efficient than most people think (yes, even for Asian languages), can be picked up mid-stream, easily identifiable in most cases, doesn't require a BOM, binary-sortable...
penpen about 14 years

Sorry, I should have took my time before commenting.
Russell Newquist about 14 years

Don't misunderstand me - given the constraints under which they were working, I'm quite impressed with the format. But the honest reality is that if we were starting from scratch today, we'd just be using a straight 32 or 64-bit character set, end of story. Pure elegance in its simplest form.
Mihai Nita about 14 years

There is only one good reason to store sources as UTF-8: if you comment in a language that needs non-ASCII characters. For UI/messages the strings should be stored in some kind of resource files/message catalogs. Good internationalization practice.
AgilePro about 11 years

You really should NOT pick any encoding other than UTF-8 or ASCII. UTF-8 supports all the Java characters (that is important). ASCII does not, but is portable everywhere. Any other choice for encoding is likely to be a problem somewhere along the line.
Admin over 9 years

UTF-8 does not use a byte order mark. While it can use multiple bytes to represent a single Unicode code point, it is not a multibyte character set. UTF-16 uses two bytes (or four with a surrogate) so byte order is relevant there. Think of it this way. UTF-8 "consumes" one byte at a time from an input stream, possibly consuming multiple bytes in succession to put together a code point. UTF-16 consumes two bytes at a time, so the order matters.
Powerlord about 9 years

@Snowman While its true that UTF-8 doesn't use a byte order mark, it still has one: \uEFBBBF (yes, the byte order mark for UTF-8 is longer than the byte order marks for UTF-16 despite being a NOOP). All it does is mark that a file is UTF-8 and not ASCII.
diynevala almost 9 years

Good point regarding the 1960's. There was nothing wrong in 1960's except that computing kind of sucked.
Adam Kurkiewicz over 6 years

What about users trying to compile their old source files with special characters in them? The eclipse's decision seems to be directly linked to the behaviour of javac, which by default uses platform's default encoding.