Handle French Characters in Java

17,988

Solution 1

This is an encoding problem, and the à clearly identify that this is UTF-8 text interpreted as ISO-Latin-1 (or one of its cousins).

Ensure that your JSP-page at the top show that it uses UTF-8 encoding.

Solution 2

You get "ABC Farmacéutica Corporation" because the string you receive from the client is ISO-8859-1, you need to convert it into UTF-8 before you URL decode it. Like this :

bbb = URLDecoder.decode(new String(bbb.getBytes("ISO-8859-1"), "UTF-8"), "UTF-8");

NOTE : some encodings cannot be converted from and to different encodings without risking data loss. For example, you cannot convert Thaï characters (TIS-620) to another encoding, not even UTF-8. For this reason, avoid converting from one encoding to another, unless ultimately necessary (ie. the data comes from an external, third perty, or proprietary source, etc.) This is only a solution on how to convert from one source to another, knowing the source encoding.

Solution 3

I suspect the problem is with character encoding on the page. Make sure the page you submit from and the one you display to use the same character set and make sure that you set it explicitely. for instance if your server runs on Linux the default encoding will be UTF-8 but if you view the page on Windows it will assume (if no encoding is specified) it to be ISO-8859-1. Also when you are receiving the submitted text on your server side, the server will assume the default character set when building the string -- whereas your user might have used a differrent encoding if you didn't specify one.

Solution 4

As I understand it, the text is hardcoded in controller code like this:

    ModelAndView mav = new ModelAndView("hello");
    mav.addObject("message", "ABC Farmacéutica Corporation");
    return mav;

I expect this would work:

    ModelAndView mav = new ModelAndView("hello");
    mav.addObject("message", "ABC Farmac\u00e9utica Corporation");
    return mav;

If so, the problem is due to a mismatch between the character encoding your Java editor is using and the encoding your compiler uses to read the source code.

For example, if your editor saves the Java file as UTF-8 and you compile on a system where UTF-8 is not the default encoding, then you would need to tell your compiler to use that encoding:

javac -cp foo.jar -encoding UTF-8 Bar.java

Your build scripts and IDE settings need to be consistent when handling character data.

If your text editor saved your file as UTF-8 then, in a hex editor, é would be the byte sequence C3 A9; in many other encodings, it would have the value E9. ISO-8859-1 and windows-1252 would encode é as C3 A9. You can read about character encoding in Java source files here.

Share:
17,988
Max
Author by

Max

The problem is never THE Problem!

Updated on July 07, 2022

Comments

  • Max
    Max almost 2 years

    I Have a Page where I search for a term and it is displaying perfect. Whatever character type it is.

    Now when I have few checkboxes in JSP and I check it and submit. In these checkboxes I have one box name like ABC Farmacéutica Corporation.

    When I click on submit button, I am calling a function and will set all parameters to a form and will submit that form. (I tested putting alert for the special character display before submit and it is displaying good).

    Now, coming to the Java end, I use Springs Frame work. When I print the term in controller, then it is displayed like ABC Farmacéutica Corporation.

    Please help... Thanks in advance.

    EDIT :

    Please try this sample Example

    import java.net.*;
    class sample{
        public static void main(String[] args){
            try{
                String aaa = "ABC Farmacéutica Corporation";
                String bbb = "ABC Farmacéutica Corporation";
    
                aaa = URLEncoder.encode(aaa, "UTF-8");
                bbb = URLDecoder.decode(bbb, "UTF-8");
    
                System.out.println("aaa   "+aaa);
                System.out.println("bbb   "+bbb);
    
            }catch(Exception e){
                System.out.println(e);      
            }
        }
    }
    

    I am getting output as,

    aaa   PiSA+Farmac%C3%A9utica+Mexicana+Corporativo
    bbb   PiSA Farmacéutica Mexicana Corporativo
    

    Try to print the string aaa as it is.

  • Max
    Max almost 13 years
    I have to pass that term to web services. initially when I am getting the complete data, all the terms are displayed correctly. The only problem is when I send it to web service, not able to send the same term to it.
  • Max
    Max almost 13 years
    Yes I am using UTF-8 for JSP. Still the problem persists
  • matbrgz
    matbrgz almost 13 years
    Then look at this particular text snippet from generation to it ends up in the output stream. It might also be because you have written a property file in UTF-8 and then read it under Windows.
  • Liv
    Liv almost 13 years
    are you setting the correct encoding when dealing with your webservices?
  • Max
    Max almost 13 years
    I am not encoding in java end. Because I tried using URLEncoder.encode(term, "UTF-8"). then If I print it in logger, displaying as ABC+Farmac%C3%A9utica+Corporation. This is not identified by webservice
  • Liv
    Liv almost 13 years
    it's not about url encoding the data -- if you are using a webservice (SOAP I guess?) when you pass the data is the encoding of the data sent (posted) and received set correctly?
  • Max
    Max almost 13 years
    I don't have idea on that, because I will put all the fields in one object and will pass that object to a web service link
  • Liv
    Liv almost 13 years
    what do you use for the webservices call -- Axis?
  • Max
    Max almost 13 years
    I use spring annotation @RequestWrapper and set localName and targetNamespace and className
  • matbrgz
    matbrgz almost 13 years
    Where does the "ABC Farmacéutica Corporation" string come from? Where is it physically defined?
  • Max
    Max almost 13 years
    It's part of my table data only. In that multiple terms I got this special character data.
  • matbrgz
    matbrgz almost 13 years
    In which physical file is the characters "ABC Farmacéutica Corporation" found? The JSP page? A property file? Java code?
  • Max
    Max almost 13 years
    I can say that I see first In JSP. well, I am confused what you are expecting. From Page1 I do a search, then in Page2 I will display these terms in table with checkboxes. Now I click on the checkboxes and will submit to page3. So while submitting getting the problem because one of the checkbox term is having this special character
  • matbrgz
    matbrgz almost 13 years
    Somewhere, the actual characters that make up the string "ABC Farmacéutica Corporation" are typed by you or somebody else into a file. If you needed to change it into "Carperation" where would you edit?
  • matbrgz
    matbrgz almost 13 years
    So this is defined as a String inside a SomeClass.java file?
  • matbrgz
    matbrgz almost 13 years
    In that case replace "é" with "\u00E9" in your source and try again.
  • Max
    Max almost 13 years
    So, \u00e9 is which encoding part. So that I will try sending the term from JSP by converting these kind of é to \u00e9
  • Max
    Max almost 13 years
    Hi, This makes sense to me. But I see you are changing "é" with "\u00E9". So which part of encoding is that. So that I can use dynamically all these kind of characters.
  • matbrgz
    matbrgz almost 13 years
    No, in the physical file where you defined "ABC etc" as a string constant, you change that physical constant to have "\u00E9" instead of "é".
  • McDowell
    McDowell almost 13 years
    @Max - \u00e9 is a (UTF-16) Unicode escape sequence. I have an app here that will display the escapes for any graphemes you enter.
  • Max
    Max almost 13 years
    Getting error as ,org.apache.cxf.binding.soap.SoapFault: Error performing Ms FAST Search.EX class: com.fastsearch.esp.search.SearchEngineException. EX Cause: null. EX Message : parsefql: Query Error: line 1:92: unexpected char: 'u'.
  • matbrgz
    matbrgz almost 13 years
    You have done this incorrectly. The character sequence should be expanded when read from the Java source file. I would suggest that you create a minimal but fully functional example showing just this behaviour, and open a new question.
  • Max
    Max almost 13 years
    Hey, Thanks Andersen. I just found the error with this discussion. Till now my webservice team is not accepting the encoded terms.
  • Paŭlo Ebermann
    Paŭlo Ebermann almost 13 years
    No, one doesn't want to recode existing strings (since there are cases where you simply get a ? instead). Better make sure the string does not arrive in the wrong encoding.
  • Yanick Rochon
    Yanick Rochon almost 13 years
    @Paŭlo, ok.... why the downvote? I already did mentioned in the question's comment about having all files encoded into UTF-8. However seeing that no one could provide a suitable solution for the OP, I'm suggesting this, which is valid Java to convert a string into a different encoding. The string displayed in his controller is clearly an ISO-8859-1 encoded string output in a UTF-8 environment. I'm not arguing the use of an encoding (I never use ISO-8859-1), I'm simply suggesting a solution that might work.
  • Paŭlo Ebermann
    Paŭlo Ebermann almost 13 years
    (It's the other way around, a UTF-8-encoded string decoded as ISO-8859-1.) The conversion should start at a lower point, where the data enters the program (in byte[] form). If you have a wrongly decoded String, it is most often too late, and encoding and decoding the string again does help in many, but not in all cases, since these encodings do not have the same range of valid bytes. (If you edit your post to say something like this as a disclaimer, I will remove my downvote - now I simply can't, until your post is edited again.)
  • Yanick Rochon
    Yanick Rochon almost 13 years
    yes, an UTF-8 string displayed as an ISO-8859-1 encoded string. In any case, disclaimer added.
  • matbrgz
    matbrgz almost 13 years
    I do not understand your explanation. What was the problem?