The ultimate emoji encoding scheme

12,683

Solution 1

MySQL's utf8 charset is not actually UTF-8, it's a subset of UTF-8 only supporting the basic plane (characters up to U+FFFF). Most emoji use code points higher than U+FFFF. MySQL's utf8mb4 is actual UTF-8 which can encode all those code points. Outside of MySQL there's no such thing as "utf8mb4", there's just UTF-8. So:

Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?

Again, no such thing as "utf8mb4". HTTP POST requests support any raw bytes, if your client sends UTF-8 encoded data you're fine.

If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?

Yes.

Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols?

God no, use raw UTF-8 (utf8mb4) for all that is holy.

When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8

Well, there's your problem; channeling your data through MySQL's utf8 charset will discard any characters above U+FFFF. Use utf8mb4 all the way through MySQL.

if I get them in utf8mb4 the json_decode function doesn't work

You'll have to specify what that means exactly. PHP's JSON functions should be able to handle any Unicode code point just fine, as long as it's valid UTF-8:

echo json_encode('😀');
"\ud83d\ude00"

echo json_decode('"\ud83d\ude00"');
😀

Solution 2

Use utf8mb4 throughout MySQL:

  • SET NAMES utf8mb4
  • Declare the table/columns CHARACTER SET utf8mb4
  • Emoji and certain Chinese characters will work in utf8mb4, but not in MySQL's utf8.

Use UTF-8 throughout other things:

  • HTML:

¿ or á are (or at least can be) encoded in utf8 (utf8mb4)

Share:
12,683
Carlos Navarro Astiasarán
Author by

Carlos Navarro Astiasarán

Updated on July 11, 2022

Comments

  • Carlos Navarro Astiasarán
    Carlos Navarro Astiasarán almost 2 years

    This is my environment: Client -> iOS App, Server ->PHP and MySQL.

    The data from client to server is done via HTTP POST.

    The data from server to client is done with json.

    I would like to add support for emojis or any utf8mb4 character in general. I'm looking for the right way for dealing with this under my scenario.

    My questions are the following:

    1. Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?

    2. If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?

    3. Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols? If so, which encoding method should I use so that it works flawlessly in Objective-C and PHP (and java for the future android version)?

    Right now I have the DB with utf8mb4 but I get errors when trying to store a raw emoji. On the other hand, I can store non-utf8 symbols such ¿ or á.

    When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8 (if I get them in utf8mb4 the json_decode function doesn't work), then such symbols are encoded (e.g., ¿ is encoded to \u00bf).

  • Carlos Navarro Astiasarán
    Carlos Navarro Astiasarán over 8 years
    Thanks for your answer. I think my real problem has to do with the DB configuration. By now I'm using a shared host, so I can't fully change from utf8 to utf8mb4. For instance, the only MySQL variables I'm able to change to utf8mb4 are character_set_database and collation_database, but the remaining (character_set_client, character_set_connection, character_set_results, character_set_server, character_set_system, collation_connection and collation_server) are still in utf8. I guess this is a gameover until I have access to my.cnf?
  • Gromski
    Gromski over 8 years
    You should be able to choose your charset when connecting to the database, in your PHP code. No need to futz around with my.cnf at all.
  • Carlos Navarro Astiasarán
    Carlos Navarro Astiasarán over 8 years
    The thing is: with ALTER DATABASE mydb CHARACTER SET = utf8mb4 COLLATE utf8mb4_unicode_ci, and with CREATE TABLE 'example' ( 'text' varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci then if I run (from phpmyadmin) INSERT INTO 'mydb'.'example' ('text') VALUES ('😀'); I get two warnings (Invalid utf8 character string and Incorrect string value: '\xF0\x9F\x98\x80') and the stored data is just ????.
  • Wei Jing
    Wei Jing almost 4 years
    The solution for me is use base64_encode to encode the text before save to DB. And decode the value when you going to use it.
  • Asif Thebepotra
    Asif Thebepotra over 3 years
    @WeiJing I think this is the best solution!