Convert Windows-1252 to UTF-16 in Java

java eclipse apache-flex

10,531

Solution 1

You could try using java.nio.charset.Charset:

final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));

Solution 2

Use the following steps:

Create an InputStreamReader using the source file's encoding (Windows-1252)
Create an OutputStreamWriter using the destination file's encoding (UTF-16)
Copy the information read from the reader to the writer. You can use BufferedReader and BufferedWriter to write contents line-by-line.

So your code may look like this:

public void reencode(InputStream source, OutputStream dest,
        String sourceEncoding, String destEncoding)
        throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
    BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
    String in;
    while ((in = reader.readLine()) != null) {
        writer.write(in);
        writer.newLine();
    }
}

This, of course, excludes try/catch stuff and delegates it to the caller.

If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:

public String decode(InputStream source, String sourceEncoding)
        throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
    StringWriter writer = new StringWriter();
    String in;
    while ((in = reader.readLine()) != null) {
        writer.write(in);
        writer.write('\n'); // Java newline should be fine, test this just in case
    }
    return writer.toString();
}

Solution 3

What seems to work so far for everything I've tested is:

private String replaceWordChars(String text_in) {
    String s = text_in;
    
    final Charset windowsCharset = Charset.forName("windows-1252");
    final Charset utfCharset     = Charset.forName("UTF-16");
    
    byte[] incomingBytes = s.getBytes();
    final CharBuffer windowsEncoded = 
        windowsCharset.decode(ByteBuffer.wrap(incomingBytes)); 
    
    final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
    s = new String(utfEncoded);
    
    return s;
}

10,531

idonaldson

Updated on June 17, 2022

Comments

idonaldson almost 2 years
I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.

So far I've tried
```
 private String replaceWordChars(String text_in) {
    String s = text_in;

    // smart single quotes and apostrophe
    s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
    // smart double quotes
    s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
    // ellipsis
    s = s.replaceAll("\\u2026", "...");
    // dashes
    s = s.replaceAll("[\\u2013|\\u2014]", "-");
    // circumflex
    s = s.replaceAll("\\u02C6", "^");
    // open angle bracket
    s = s.replaceAll("\\u2039", "<");
    // close angle bracket
    s = s.replaceAll("\\u203A", ">");
    // spaces
    s = s.replaceAll("[\\u02DC|\\u00A0]", " ");

    return s;
```
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)

However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
```
private String replaceWordChars(String text_in) {
    String s = text_in;
    try {
        byte[] b = s.getBytes("Cp1252");
        byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
        s = new String(encoded, "UTF-16");


    } catch (UnsupportedEncodingException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    return s;
```
But when I watch the encoding happen in the Eclipse debugger, nothing changes.

There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.

Any thoughts?
- Jon Skeet over 11 years
  
  In the first case you're just replacing non-ASCII characters with ASCII characters. You're not changing the encoding at all. In the second piece of code you're really doing nothing except converting all characters which can't be handled by Cp1252 into "?"
- Shadow Man about 10 years
  
  It sounds like he is reading "Cp1252" data with a reader set to use "UTF-8" encoding, that results in similar funny behavior wrt non-compatible characters (those whose "Cp1252" encoding differs from that of their "UTF-8" encoding).
Brian over 11 years

Why the downvote? No code? Writing it now. Comment first, downvote later, please.
idonaldson over 11 years

Step 1 won't work. This is coming from a Flex RIA on the web. The user is more than likely going to type up their nice looking email in word, then copy-paste into our app and fire off the email. I will give the Streams a try and see what happens.
Brian over 11 years

I just recommended it. The code I'm writing is actually just using streams. All the better. I'll include it in my edit.
idonaldson over 11 years

I wasn't the downvote, seems like that word work for a desktop application?
Brian over 11 years

Fair enough. Please see my update :) This should give you an idea of where to start.
Brian over 11 years

Also, if you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters.