Guessing the encoding of text represented as byte[] in Java
Solution 1
The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.
public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.
I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.
Solution 2
There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.
Solution 3
Here's my favorite: https://github.com/codehaus/guessencoding
It works like this:
- If there's a UTF-8 or UTF-16 BOM, return that encoding.
- If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
- If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
- Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).
It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.
Solution 4
Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:
http://www.joelonsoftware.com/articles/Unicode.html
Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.
knorv
Updated on October 22, 2020Comments
-
knorv over 3 years
Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?
Worth noting:
- No additional meta-data is available. The byte array is literally the only available input.
- The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.
-
knorv over 14 yearsWhat about the cases where it is not UTF-8?
-
knorv over 14 yearsI kind of know how to use Google, but the question specifically asks for "what is the best way [..]". So which is best, icu4j, jchardet or some other library?
-
knorv over 14 yearsPlease elaborate - why do you consider jchardet to be the best library around?
-
coding_idiot about 11 yearsmy project requirement is if the data is not in utf8 (after detection) then convert it to utf8, how to do this ?
-
coding_idiot about 11 years@chi how to convert to utf8 if the encoding is not utf8.
-
Brett Okken almost 10 years@coding_idiot use the "guessed" encoding to convert to a String then get utf-8 bytes:
new String(bytes, guessedEncoding).getBytes("utf-8")
. -
james.garriss over 8 yearsIf it's not UTF-8, blindly calling it Latin-1 isn't a good idea. It would be better to use ICU, jchardet, or one of the other tools listed on this page to make an intelligent guess.
-
Sxilderik over 6 yearsNot very happy with this. See github.com/albfernandez/juniversalchardet/issues/22
-
Aleksandr Erokhin over 5 yearsjuniversalchardet is also available in maven. groupId: com.googlecode.juniversalchardet, artifactId: juniversalchardet.