Remove or Encode Non-UTF-8 Characters
13,849
If you have a UTF-8 string that might contain invalid characters, you can use iconv
to remove those. This should work:
$text = iconv("utf-8", "utf-8//ignore", $text);
Making them visible with an arbitrary placeholder is a bit tougher - I can't think of any easy way to do that, short of walking through every byte and see whether it's a valid character. The Wikipedia article provides more info on how to do that.
Comments
-
itsme almost 2 years
Is there a function to remove all non UTF-8 characters from a string?
-
itsme over 12 yearsbtw this code allows me to show special chars right? it doesn't removes them as i can see but it encode them to utf-8; i'm right? :P
-
Pekka over 12 years@Ispuk nope, this should remove only non-UTF-8 characters from a UTF-8 string. If you need to do something else (like convert characters from some other encoding) you need to know what the original encoding is
-
itsme over 12 yearsdoes i need to check the http header? which param specifies the exactly encoding charset of the request? :)
-
itsme over 12 yearscause actually your code makes me able to show chars not to remove, and will be not a problem, maybe better :), but i would like to understand why :P
-
Pekka over 12 years@Ispuk It depends where your data comes from. If it's from a HTTP request, the
Content-type
header should contain the character set. If it's from a file, there may not be a content type defined at all - in that case, you should be getting the information separately. Trying to detect the character set from data is relatively unreliable. -
itsme over 12 yearsin this case data is passed from an XHR request so i guess charset is the same as HTTP right? if yes i'm passing utf-8; data
-
itsme over 12 yearsjust curiousity, you know if is out of there any utf-8; chars list?
-
Pekka over 12 years@Ispuk UTF-8 has a lot of characters, tens of thousands. There are several attempts to document them all, e.g. fileformat.info/info/charset/UTF-8/list.htm