Check if a String contains encoded characters
Solution 1
Sounds like you want to check if a string that was decoded from bytes in latin1 could have been decoded in UTF-8, too. That's easy because illegal byte sequences are replaced by the character \ufffd:
String recoded = new String(encoded.getBytes("iso-8859-1"), "UTF-8");
return recoded.indexOf('\uFFFD') == -1; // No replacement character found
Solution 2
If I correctly understood your question, this code may help you. The function isEncoded check if its parameter could be encoded as ascii or if it contains non ascii-chars.
public boolean isEncoded(String text){
Charset charset = Charset.forName("US-ASCII");
String checked=new String(text.getBytes(charset),charset);
return !checked.equals(text);
}
@Test
public void testAscii() throws Exception{
Assert.assertFalse(isEncoded("Hello world"));
}
@Test
public void testNonAscii() throws Exception{
Assert.assertTrue(isEncoded("Hellä world"));
}
You can also check for other charset changing charset var or moving it to a parameter.
Solution 3
String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
This code is just a character corruption bug. You take a UTF-16 string, transcode it to UTF-8, pretend it is ISO-8859-1 and transcode it back to UTF-16, resulting in incorrectly encoded characters.
Solution 4
Your question doesn't make sense. A java String
is a list of characters. They don't have an encoding until you convert them into bytes, at which point you need to specify one (although you will see a lot of code that uses the platform default, which is what e.g. String.getBytes()
with no argument does).
I suggest you read this http://kunststube.net/encoding/.
Solution 5
I'm not really sure what are you trying to do or what is your problem.
This line doesn't make any sense:
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
You are encoding your name
into "UTF-8" and then trying to decode as "iso8859-1".
If you what to encode your name
as "iso8859-1" just do name.getBytes("iso8859-1")
.
Please tell us what is the problem you encountered so that we can help more.
Decrypter
I have a strong passion for programming and a particular interest in web technologies. I love to experiment with interesting web technologies both server and client side.
Updated on July 04, 2020Comments
-
Decrypter almost 4 years
Hello I am looking for a way to detect if a string has being encoded
For example
String name = "Hellä world"; String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
The output of this
encoded
variable is:Hellä world
As you can see there is an A with grave and another symbol. Is there a way to check if the output contains encoded characters?
-
Andrea Parodi almost 12 yearsI think you are only testing if the String contains a char in "other letter" unicode group. But Character.getType('ä') == Character.LOWERCASE_LETTER and Character.getType('a') == Character.LOWERCASE_LETTER
-
Pooya almost 12 yearsYes, because I think the question is how to find that a string contains encoded chars or not, and this method returns that
-
Andrea Parodi almost 12 yearsBut Character.getType('ä') == Character.LOWERCASE_LETTER and Character.getType('ä') != Character.OTHER_LETTER, so your code does not work. The Character.OTHER_LETTER does not contain all unicode chars, only a particular subgroup.
-
Admin over 10 yearsThis answer is absolutely correct, but may still be somewhat cryptic to newbies. The question, really, is "How can I tell if a String has been encoded with a certain encoding?" The short answer is: trial and error. You can set up a
CharsetDecoder
configured for a particular target encoding (UTF-8/ISO-8859-1, etc.), and try to run your String through that decoder. If the decoding fails or throws an exception, you know your String contains 1+ characters that aren't that target encoding. If the decoder decodes without error, then you know your String meets the criteria for that encoding.