Java; Trying to convert a String which contains ISO-8859-1 encoding to UTF-8 but file is UTF-8

37,308

Solution 1

I finally got it to show the way I specified in the question, I was just using the wrong charset.

intento2 = new String(input.getBytes(Charset.forName("UTF-8")), Charset.forName("Windows-1252"));

This displayed it the way I needed it.

Solution 2

In Simple Words ,if you want to convert charset=iso-8859-1 to java string (which is UTF-8 by default)

 String response= new String(input.getBytes("ISO-8859-1"),"UTF-8");

Solution 3

I think the fundamental problem here is your expectations.

If I understand you correctly, you expect to be able to change Á to à by changing character encodings. That cannot happen. Those are different characters; i.e. different code points - Á is Unicode codepoint 00C1 (or C1 in ISO-8859-1) and à is 00C3 / C3.

So when you transcode a Á in ISO-8859-1 to Unicode to UTF-8 you should get exactly the same character Á. If you don't then the translation would be broken.

You also expect MÉXICO to translate to MÉXICO ... which seems totally bizarre to me. Perhaps there's a problem in your transcription of the characters into the Question ...

Now if the lexicography rules for your language / region say that Á to à are actually equivalent, then it would be reasonable to "normalize" to a preferred form. However, it is not the role of the character encoding / decoding to do such locale-related translations. You need to code it yourself ... or find some other library that does it.


Messing around at the byte level (encoding with one charset and decoding with a different one) is not going to "fix" this. If anything it is going to make things worse. Your messing around is generating byte sequences that can't be mapped to the target encoding scheme ... and hence the question marks.

Share:
37,308
alexhg11
Author by

alexhg11

Updated on January 10, 2020

Comments

  • alexhg11
    alexhg11 over 4 years

    I don't know if this is going to make sense but this is what I make of it.

    I'm working with Eclipse using UTF-8 encoding for all my files. In one of them I need to convert a String from ISO-8859-1 to UTF-8. However that string is formed within the file itself (doesn't come from input) which is why I believe my String starts out as UTF-8 and the conversion doesn't go the way i expected.

    The String original content is:

    ||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JUÁREZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|Persona Física con Actividad Empresarial|BAÑ930616R66|BAÑOMOBIL, S.A. DE C.V.|Av. 1° de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|MÉXICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 año www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 año www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||
    

    Which original encoding should be ISO-8859-1 and when I convert it to UTF-8 it should generate.

    ||3.2|2013-01-25T17:05:06|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JUÃREZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÃREZ|ESTADO DE MEXICO|MÉXICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÃREZ|ESTADO DE MEXICO|MÉXICO|53100|Persona Física con Actividad Empresarial|BAÑ930616R66|BAÑOMOBIL, S.A. DE C.V.|Av. 1° de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|MÉXICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 año www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 año www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||
    

    Which is what I need, and I'm not achieving it.

    this is what I have tried so far.

        String input = null;
        input = "||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JUÁREZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|Persona Física con Actividad Empresarial|BAÑ930616R66|BAÑOMOBIL, S.A. DE C.V.|Av. 1° de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|MÉXICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 año www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 año www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||";
        String intento1 = null, intento2 = null, intento3 = null;
        try {
            intento1 = new String(input.getBytes("ISO-8859-1"),"UTF-8");
            intento2 = new String(intento1.getBytes(), "UTF-8");
            intento3 = new String(input.getBytes(),"UTF-8");
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        System.out.println(intento1);
        System.out.println(intento2); 
        System.out.println(intento3);   
    

    Which returns

    ||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JU?REZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|Persona F?sica con Actividad Empresarial|BA?930616R66|BA?OMOBIL, S.A. DE C.V.|Av. 1? de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|M?XICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 a?o www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 a?o www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||
    ||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JU?REZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|Persona F?sica con Actividad Empresarial|BA?930616R66|BA?OMOBIL, S.A. DE C.V.|Av. 1? de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|M?XICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 a?o www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 a?o www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||
    ||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JU?REZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|Persona F?sica con Actividad Empresarial|BA?930616R66|BA?OMOBIL, S.A. DE C.V.|Av. 1? de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|M?XICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 a?o www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 a?o www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||
    

    Which is not near what I want.

    EDIT 1: When I get the String from an Input one of the conversions work fine, but I need it to work declared inside the file.

    EDIT 2: This is basically what I need http://cryptosys.net/cgi-bin/manual.cgi?m=pki&name=CNV_UTF8FromLatin1 but in java

  • Samuel Edwin Ward
    Samuel Edwin Ward about 11 years
    I don't believe the internal representation of String data in Java is UTF-8.
  • JimN
    JimN about 11 years
    Ergh. My bad. They are UTF-16. :) But yeah, they're not ISO-8859-1.
  • alexhg11
    alexhg11 about 11 years
    The thing is, when I get it from Input source like if I paste it to command line instead of declaring it inside the file, I do get the desired results. I know it seems strange what I need to achieve but it's a requirement from another service that's why I need it to behave that way and I thought a charset encoding and decoding would do the trick and in a way it does (I've only achieved it through input source) but I can't manage to do it from within the file.
  • alexhg11
    alexhg11 about 11 years
    Is there a way to declare a String in a diferent encoding?
  • Stephen C
    Stephen C about 11 years
    @alexhg11 - I believe you are going to have to write Java code to translate individual characters as per your requirements. (But is that translation of MÉXICO for real? It looks bizarre ... in my browser)
  • JimN
    JimN about 11 years
    I'll repeat: Java strings are internally UTF-16. When constructing a Java string from character data, you may specify an encoding to use for converting to the internal UTF-16 encoding.
  • alexhg11
    alexhg11 about 11 years
    Apparently, it is all too bizarre to me as well, but that's how a WebService requires it to be, and that's what I'm having trouble with :/. How would I go about doing that? Do I have to do it at a byte level?
  • JimN
    JimN about 11 years
    You may want to put all of your localized strings in a separate data file (instead of pasting into your Java source file), and then load that data file at runtime, using the same encoding which was used to create/edit the file. Look at java.util.Properties for a convenient way to store key/value pairs in a file. When loading the file, use a Reader which allows you to specify an encoding to use. For example: new InputStreamReader(new FileInputStream(file), "ISO-8859-1"). For deployment purposes, it may be easier to load the data as a resource rather than a file.
  • Reza Hamzehei
    Reza Hamzehei almost 6 years
    Worked well ! upVote :)