enforce column encoding with sqlalchemy

16,910

Solution 1

To specify a specific collation per column, use the collation parameter on the data type:

class MyTableName(Base):
    __tablename__ = "mytablename2"
    test_column = Column(Unicode(2),
                         primary_key=True)
    test_column2 = Column(Unicode(2, collation='utf8_bin'))
#                                    ^^^^^^^^^^^^^^^^^^^^

Mind that MySQL understands this as both the set of codepoints to describe the text as well as the sort order the text will be indexed with; the usual suspects like 'utf8' or 'utf-8' won't be familiar to MySQL (use SHOW COLLATION to see the full list)

mysql> show full columns from mytablename2;
+--------------+------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| Field        | Type       | Collation         | Null | Key | Default | Extra | Privileges                      | Comment |
+--------------+------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| test_column  | varchar(2) | latin1_swedish_ci | NO   | PRI | NULL    |       | select,insert,update,references |         |
| test_column2 | varchar(2) | utf8_bin          | YES  |     | NULL    |       | select,insert,update,references |         |
+--------------+------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
2 rows in set (0.00 sec)

mysql> 

Solution 2

For me the collation parameter did not work.

My connection string was:

db = create_engine('mysql+pymysql://user:pass@dbhost/schema?charset=utf8')

Pymysql was executing set names utf8 because of the charset, and the database was converting the utf8 to the encoding of the table, resulting in data loss.

If i left the charset out, the charset was defaulting to latin1, and pymysql was trying to encode my utf8 strings to latin1 before sending them to the database, thus throwing UnicodeEncode errors.

This worked for me :session.execute(text("SET NAMES latin1")) to make the database assume the utf8 strings i was sending did not need to be converted.

Share:
16,910
Jealie
Author by

Jealie

Updated on June 03, 2022

Comments

  • Jealie
    Jealie almost 2 years

    I am using sqlalchemy to create the schema of my database. I have no success in enforcing the use of utf-8, no matter what I tried.

    Here is a minimal python script that recreates my problem:

    from sqlalchemy import create_engine, Column, Unicode
    from sqlalchemy.ext.declarative import declarative_base
    engine = create_engine('mysql+mysqldb://user:password@localhost/multidic?charset=utf8', echo=True)
    Base = declarative_base()
    class MyTableName(Base):
        __tablename__ = "mytablename"
        test_column = Column(Unicode(2),primary_key=True)
    Base.metadata.create_all(engine)
    

    After running this script, when I look into the database, I see that the encoding is latin1 instead of utf-8:

    mysql> SHOW FULL COLUMNS FROM mytablename;
    +-------------+------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
    | Field       | Type       | Collation         | Null | Key | Default | Extra | Privileges                      | Comment |
    +-------------+------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
    | test_column | varchar(2) | latin1_swedish_ci | NO   | PRI | NULL    |       | select,insert,update,references |         |
    +-------------+------------+-------------------+------+-----+---------+-------+---------------------------------+---------+
    1 row in set (0.00 sec)
    

    I have tried changing the type of the column created (String instead of Unicode), and tried also to add the argument encoding = "utf8" in the call to create_engine, but none of it worked.

    So, my question is:

    How to enforce the use of a given character encoding (utf-8 in my case) in MySQL, with sqlalchemy ?

    Thank you :)

    Notes:

    I am using sqlalchemy 0.7 and python 2.7; I can possibly upgrade one or both, but only if it is the only solution!

    I have mysql 5, and it supports utf-8:

    mysql> show character set where charset="utf8";
    +---------+---------------+-------------------+--------+
    | Charset | Description   | Default collation | Maxlen |
    +---------+---------------+-------------------+--------+
    | utf8    | UTF-8 Unicode | utf8_general_ci   |      3 |
    +---------+---------------+-------------------+--------+
    1 row in set (0.00 sec)
    
  • Jealie
    Jealie over 10 years
    Trying test_column2 = Column(Unicode(2, collation='utf8_bin')) throws an error: TypeError: __init__() got an unexpected keyword argument 'collation' ... In addition, trying test_column2 = Column(Unicode(2), collation='utf8_bin') throws another error: sqlalchemy.exc.ArgumentError: Unknown arguments passed to Column: ['collation']
  • flow
    flow over 10 years
    exactly what the world needs—A DATABASE THAT CONFUSES ENCODING WITH COLLATION, AND CALLS ENCODING 'CHARSET'. did you know Latin-1 is seriously OUTDATED ever since it got replaced by Latin-9 = ISO-8859-15 in 1999? Latin-1 doesn't even have a Euro-sign.
  • Jealie
    Jealie over 10 years
    Upgrading to the latest version of SQLalchemy (i.e. 0.8.2) made your solution work... Thank you !!