Remove or Encode Non-UTF-8 Characters

13,849

If you have a UTF-8 string that might contain invalid characters, you can use iconv to remove those. This should work:

$text = iconv("utf-8", "utf-8//ignore", $text);

Making them visible with an arbitrary placeholder is a bit tougher - I can't think of any easy way to do that, short of walking through every byte and see whether it's a valid character. The Wikipedia article provides more info on how to do that.

Share:
13,849
itsme
Author by

itsme

JS

Updated on June 14, 2022

Comments

  • itsme
    itsme almost 2 years

    Is there a function to remove all non UTF-8 characters from a string?

  • itsme
    itsme over 12 years
    btw this code allows me to show special chars right? it doesn't removes them as i can see but it encode them to utf-8; i'm right? :P
  • Pekka
    Pekka over 12 years
    @Ispuk nope, this should remove only non-UTF-8 characters from a UTF-8 string. If you need to do something else (like convert characters from some other encoding) you need to know what the original encoding is
  • itsme
    itsme over 12 years
    does i need to check the http header? which param specifies the exactly encoding charset of the request? :)
  • itsme
    itsme over 12 years
    cause actually your code makes me able to show chars not to remove, and will be not a problem, maybe better :), but i would like to understand why :P
  • Pekka
    Pekka over 12 years
    @Ispuk It depends where your data comes from. If it's from a HTTP request, the Content-type header should contain the character set. If it's from a file, there may not be a content type defined at all - in that case, you should be getting the information separately. Trying to detect the character set from data is relatively unreliable.
  • itsme
    itsme over 12 years
    in this case data is passed from an XHR request so i guess charset is the same as HTTP right? if yes i'm passing utf-8; data
  • itsme
    itsme over 12 years
    just curiousity, you know if is out of there any utf-8; chars list?
  • Pekka
    Pekka over 12 years
    @Ispuk UTF-8 has a lot of characters, tens of thousands. There are several attempts to document them all, e.g. fileformat.info/info/charset/UTF-8/list.htm