How to determine if a String contains invalid encoded characters

105,750

Solution 1

I asked the same question,

Handling Character Encoding in URI on Tomcat

I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

  1. Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
  2. If you have to manually URL decode, use Latin1 as charset also.
  3. Use the fixEncoding() function to fix up encodings.

For example, to get a parameter from query string,

  String name = fixEncoding(request.getParameter("name"));

You can do this always. String with correct encoding is not changed.

The code is attached. Good luck!

 public static String fixEncoding(String latin1) {
  try {
   byte[] bytes = latin1.getBytes("ISO-8859-1");
   if (!validUTF8(bytes))
    return latin1;   
   return new String(bytes, "UTF-8");  
  } catch (UnsupportedEncodingException e) {
   // Impossible, throw unchecked
   throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
  }

 }

 public static boolean validUTF8(byte[] input) {
  int i = 0;
  // Check for BOM
  if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
    && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
   i = 3;
  }

  int end;
  for (int j = input.length; i < j; ++i) {
   int octet = input[i];
   if ((octet & 0x80) == 0) {
    continue; // ASCII
   }

   // Check for UTF-8 leading byte
   if ((octet & 0xE0) == 0xC0) {
    end = i + 1;
   } else if ((octet & 0xF0) == 0xE0) {
    end = i + 2;
   } else if ((octet & 0xF8) == 0xF0) {
    end = i + 3;
   } else {
    // Java only supports BMP so 3 is max
    return false;
   }

   while (i < end) {
    i++;
    octet = input[i];
    if ((octet & 0xC0) != 0x80) {
     // Not a valid trailing byte
     return false;
    }
   }
  }
  return true;
 }

EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

The solution I posted here is not perfect but it's the best one we found so far.

Solution 2

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

 CharsetDecoder UTF8Decoder =
      Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);

See CodingErrorAction.REPORT

Solution 3

This is what I used to check the encoding:

CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);

CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
    result.isUnderflow() || result.isMalformed() ||
    result.isUnmappable())
{
    System.out.println("Cannot decode EBCDIC");
}
else
{
    CoderResult result = ebcdicDecoder.flush(out);
    if (result.isOverflow())
       System.out.println("Cannot decode EBCDIC");
    if (result.isUnderflow())
        System.out.println("Ebcdic decoded succefully ");
}

Edit: updated with Vouze suggestion

Solution 4

Replace all control chars into empty string

value = value.replaceAll("\\p{Cntrl}", "");

Solution 5

I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.

To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:

  1. No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
  2. Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
  3. Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).

If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.

Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.

(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)

Share:
105,750
Daniel Hiller
Author by

Daniel Hiller

I transform ☕ into stuff that is hopefully useful to others.

Updated on December 02, 2020

Comments

  • Daniel Hiller
    Daniel Hiller over 3 years

    Usage scenario

    We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

    Note: We use restlet, not tomcat

    Original Problem

    Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

    Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

    I.e. for a query part looking like

        ...v=abcädef
    

    if "ISO-8859-1" is selected, the sent query part looks like

    ...v=abc%E4def
    

    but if "UTF-8" is selected, the sent query part looks like

    ...v=abc%C3%A4def
    

    Desired Solution

    As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

    Current Solution In Detail

    Check for each character ( == string.substring(i,i+1) )

    1. if character.getBytes()[0] equals 63 for '?'
    2. if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

    Code

    protected List< String > getNonUnicodeCharacters( String s ) {
      final List< String > result = new ArrayList< String >();
      for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
        final String character = s.substring( i , i + 1 );
        final boolean isOtherSymbol = 
          ( int ) Character.OTHER_SYMBOL
           == Character.getType( character.charAt( 0 ) );
        final boolean isNonUnicode = isOtherSymbol 
          && character.getBytes()[ 0 ] == ( byte ) 63;
        if ( isNonUnicode )
          result.add( character );
      }
      return result;
    }
    

    Question

    Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

    Note: I checked URLDecoder with the following code

    final String[] test = new String[]{
      "v=abc%E4def",
      "v=abc%C3%A4def"
    };
    for ( int i = 0 , n = test.length ; i < n ; i++ ) {
        System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
        System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
    }
    

    This prints:

    v=abc?def
    v=abcädef
    v=abcädef
    v=abcädef
    

    and it does not throw an IllegalArgumentException sigh