Convert Windows-1252 to UTF-16 in Java
Solution 1
You could try using java.nio.charset.Charset
:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));
Solution 2
Use the following steps:
- Create an
InputStreamReader
using the source file's encoding (Windows-1252) - Create an
OutputStreamWriter
using the destination file's encoding (UTF-16) - Copy the information read from the reader to the writer. You can use
BufferedReader
andBufferedWriter
to write contents line-by-line.
So your code may look like this:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
This, of course, excludes try/catch stuff and delegates it to the caller.
If you're just trying to get the contents as a string of sorts, you can replace the writer
with StringWriter
and return its toString
value. Then you don't need a destination stream or encoding, just a place to dump characters:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}
Solution 3
What seems to work so far for everything I've tested is:
private String replaceWordChars(String text_in) {
String s = text_in;
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
byte[] incomingBytes = s.getBytes();
final CharBuffer windowsEncoded =
windowsCharset.decode(ByteBuffer.wrap(incomingBytes));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
s = new String(utfEncoded);
return s;
}
Related videos on Youtube
idonaldson
Updated on June 17, 2022Comments
-
idonaldson almost 2 years
I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.
So far I've tried
private String replaceWordChars(String text_in) { String s = text_in; // smart single quotes and apostrophe s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'"); // smart double quotes s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\""); // ellipsis s = s.replaceAll("\\u2026", "..."); // dashes s = s.replaceAll("[\\u2013|\\u2014]", "-"); // circumflex s = s.replaceAll("\\u02C6", "^"); // open angle bracket s = s.replaceAll("\\u2039", "<"); // close angle bracket s = s.replaceAll("\\u203A", ">"); // spaces s = s.replaceAll("[\\u02DC|\\u00A0]", " "); return s;
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)
However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
private String replaceWordChars(String text_in) { String s = text_in; try { byte[] b = s.getBytes("Cp1252"); byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16"); s = new String(encoded, "UTF-16"); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } return s;
But when I watch the encoding happen in the Eclipse debugger, nothing changes.
There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.
Any thoughts?
-
Jon Skeet over 11 yearsIn the first case you're just replacing non-ASCII characters with ASCII characters. You're not changing the encoding at all. In the second piece of code you're really doing nothing except converting all characters which can't be handled by Cp1252 into "?"
-
Shadow Man about 10 yearsIt sounds like he is reading "Cp1252" data with a reader set to use "UTF-8" encoding, that results in similar funny behavior wrt non-compatible characters (those whose "Cp1252" encoding differs from that of their "UTF-8" encoding).
-
-
Brian over 11 yearsWhy the downvote? No code? Writing it now. Comment first, downvote later, please.
-
idonaldson over 11 yearsStep 1 won't work. This is coming from a Flex RIA on the web. The user is more than likely going to type up their nice looking email in word, then copy-paste into our app and fire off the email. I will give the Streams a try and see what happens.
-
Brian over 11 yearsI just recommended it. The code I'm writing is actually just using streams. All the better. I'll include it in my edit.
-
idonaldson over 11 yearsI wasn't the downvote, seems like that word work for a desktop application?
-
Brian over 11 yearsFair enough. Please see my update :) This should give you an idea of where to start.
-
Brian over 11 yearsAlso, if you're just trying to get the contents as a string of sorts, you can replace the writer with
StringWriter
and return itstoString
value. Then you don't need a destination stream or encoding, just a place to dump characters.