How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
Solution 1
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
Solution 2
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
Solution 3
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
𝐟𝐨𝐨
Solution 4
And just for the fun of it, an itertools
monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Solution 5
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd
:
u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
Denilson Sá Maia
Software developer || software engineer || programmer || developer. Whatever job title you want to call me. http://denilson.sa.nom.br/
Updated on July 09, 2022Comments
-
Denilson Sá Maia almost 2 years
I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their
utf8
implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters usingutf8mb4
; and, someday in future,utf8
might support it as well.But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official
\ufffd
(U+FFFD REPLACEMENT CHARACTER), or with?
.In other words, I want a behavior quite similar to Python's own
str.encode()
method (when passing'replace'
parameter). Edit: I want a behavior similar toencode()
, but I don't want to actually encode the string. I want to still have an unicode string after filtering.I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
- "Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
- ‘𠂉’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python # -*- coding: utf-8 -*- # vi:ts=4 sw=4 et import cProfile import random import re # How many times to repeat each filtering repeat_count = 256 # Percentage of "normal" chars, when compared to "large" unicode chars normal_chars = 90 # Total number of characters in this string string_size = 8 * 1024 # Generating a random testing string test_string = u''.join( unichr(random.randrange(32, 0x10ffff if random.randrange(100) > normal_chars else 0x0fff )) for i in xrange(string_size) ) # RegEx to find invalid characters re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE) def filter_using_re(unicode_string): return re_pattern.sub(u'\uFFFD', unicode_string) def filter_using_python(unicode_string): return u''.join( uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd' for uc in unicode_string ) def repeat_test(func, unicode_string): for i in xrange(repeat_count): tmp = func(unicode_string) print '='*10 + ' filter_using_re() ' + '='*10 cProfile.run('repeat_test(filter_using_re, test_string)') print '='*10 + ' filter_using_python() ' + '='*10 cProfile.run('repeat_test(filter_using_python, test_string)') #print test_string.encode('utf8') #print filter_using_re(test_string).encode('utf8') #print filter_using_python(test_string).encode('utf8')
The results:
-
filter_using_re()
did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at thesub()
built-in) -
filter_using_python()
did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at thejoin()
call and 1.900 CPU seconds evaluating the generator expression) - I did no test using
itertools
because... well... that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
-
John Machin almost 14 yearsPerhaps it should exclude surrogates. Also:
uc <= u'\uffff'
might be better thanord(uc) < 65536
-
Ishbir almost 14 yearsPerhaps
struct.unpack('<%dH' % (len(e)//2), e)
? -
Philipp almost 14 years“However this proscription is as far as I know largely unknown or ignored.”—Hopefully not! At least Python 3 refuses to encode surrogate code points (try
chr(55349).encode("utf-8")
). -
John Machin almost 14 years@Philipp: Python 3 does seem to do the "right thing" -- however your example is a LONE surrogate which is a different problem; Python 2 passes that test but not this one:
"\xed\xa0\x80\xed\xb0\x80".decode('utf8')
producesu'\U00010000'
instead of an exception. -
John Machin almost 14 years(1) The MySQL docs that I referred to declare the charset as part of the column definition:
t character(128) character set utf8
... are you sure that what you have is equivalent? (2) Try your UTF-16 stunt with Python 3.1 :-) -
Ignacio Vazquez-Abrams almost 14 years@John: (1) Retested with
character set utf8
on 2.6. Results were the same. (2) That's just a limitation of the stock UTF-8 codec. It can be worked around with a custom codec. Or with MySQL doing the right thing in the first place. -
Denilson Sá Maia almost 14 yearsHmmm... You forgot to add the
u
prefix to all strings! It should have beenu'\ufffd'
. ;) -
Flimm almost 7 yearsNote that the strings
"[^\u0000-\uFFFF]"
etc are not raw strings, that is, the string literals are not prefixed withr
! -
Rolando Urquiza almost 7 yearsI had to change the the first range end in
u'[^\u0000-\uD7FF\uE000-\uFFFF]'
from'\uD7FF'
to'\u07FF'
because there where still some chars not beign cleaned.