How can I filter Emoji characters from my input so I can save in MySQL <5.5?

14,669

Solution 1

So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.

Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):

import re
try:
    # UCS-4
    highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
    # UCS-2
    highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)

The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.

For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.

With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.

Hope this helps anyone else in the same situation!

Solution 2

I tryied the solution by BigglesZX and its wasn't woring for the emoji of the heart (❤) after reading the [emoji's wikipedia article][1] I've seen that the regular expression is not covering all the emojis while also covering other range of unicode that are not emojis.

The following code create the 5 regular expressions that cover the 5 emoji blocks in the standard:

emoji_symbols_pictograms = re.compile(u'[\U0001f300-\U0001f5fF]')
emoji_emoticons = re.compile(u'[\U0001f600-\U0001f64F]')
emoji_transport_maps = re.compile(u'[\U0001f680-\U0001f6FF]')
emoji_symbols = re.compile(u'[\U00002600-\U000026FF]')
emoji_dingbats = re.compile(u'[\U00002700-\U000027BF]')

Those blocks could be merged in three blocks (UCS-4):

emoji_block0 = re.compile(u'[\U00002600-\U000027BF]')
emoji_block1 = re.compile(u'[\U0001f300-\U0001f64F]')
emoji_block2 = re.compile(u'[\U0001f680-\U0001f6FF]')

Their equivalents in UCS-2 are:

emoji_block0 = re.compile(u'[\u2600-\u27BF]')
emoji_block1 = compile(u'[\uD83C][\uDF00-\uDFFF]')
emoji_block1b = compile(u'[\uD83D][\uDC00-\uDE4F]')
emoji_block2 = re.compile(u'[\uD83D][\uDE80-\uDEFF]')

So finally we can define a single regular expression with all the cases together:

import re
try:
    # UCS-4
    highpoints = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
except re.error:
    # UCS-2
    highpoints = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
Share:
14,669
BigglesZX
Author by

BigglesZX

If anyone wants me, I'll be in the bar.

Updated on June 20, 2022

Comments

  • BigglesZX
    BigglesZX almost 2 years

    I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:

    /home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1 return self.cursor.execute(query, args)

    I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.

    My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!

    FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.

    Thanks for reading.

  • Paulo Malvar
    Paulo Malvar about 6 years
    I had been using for years the incomplete regular expression, but this extended one appears to be working just fine and provides a lot more coverage for newer emojis.
  • Peter
    Peter about 5 years
    The flag emojis are missing here. These are made up of two regional indicator symbols, which are all in the block \U0001F1E6-\U0001F1FF.