Python & MySql: Unicode and Encoding

63,653

Solution 1

I think that your MYSQLdb python library doesn't know it's supposed to encode to utf8, and is encoding to the default python system-defined charset latin1.

When you connect() to your database, pass the charset='utf8' parameter. This should also make a manual SET NAMES or SET character_set_client unnecessary.

Solution 2

First, make sure you are assigning the charset and use_unicode parameters when making your MySQL connection:

conn = mysql.connect(host='127.0.0.1',
                     user='user',
                     passwd='passwd',
                     db='db',
                     charset='utf8',
                     use_unicode=True)

Secondly, use prepared statements when actually querying the database. Below is an example INSERT query of a string containing a unicode character.

cursor.execute('INSERT INTO mytable VALUES (null, %s)',                  
               ('Some string that contains unicode: ' + unichr(300),))
Share:
63,653
add-semi-colons
Author by

add-semi-colons

Find missing Semicolons;

Updated on December 08, 2020

Comments

  • add-semi-colons
    add-semi-colons over 3 years

    I am parsing json data and trying to store some of the json data into Mysql database. I am currently getting following unicode error. My question is how should I handle this.

    • Should I handle it from the database side, and if so how can I modify my table to do so?
    • Should I handle it from python side?

    Here is my table structure

    CREATE TABLE yahoo_questions (
       question_id varchar(40) NOT NULL, 
       question_subj varbinary(255), 
       question_content varbinary(255),
       question_userId varchar(40) NOT NULL,
       question_timestamp varchar(40),
       category_id varbinary(20) NOT NULL,
       category_name varchar(40) NOT NULL,
       choosen_answer varbinary(255),
       choosen_userId varchar(40),
       choosen_usernick varchar(40),
       choosen_ans_timestamp varchar(40),
       UNIQUE (question_id)
    );
    

    Error While inserting via python code:

    Traceback (most recent call last):
      File "YahooQueryData.py", line 78, in <module>
        +"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (row[2], row[5], row[6], quserId, questionTime, categoryId, categoryName, qChosenAnswer, choosenUserId, choosenNickName, choosenTimeStamp))
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/cursors.py", line 159, in execute
        query = query % db.literal(args)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/connections.py", line 264, in literal
        return self.escape(o, self.encoders)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/connections.py", line 202, in unicode_literal
        return db.literal(u.encode(unicode_literal.charset))
    UnicodeEncodeError: 'latin-1' codec can't encode characters in position 204-230: ordinal not in range(256)
    

    Python Code segment:

        #pushing user id to the url to get full json stack
        urlobject = urllib.urlopen(base_url.format(row[2]))
        qnadatajson = urlobject.read()
        data = json.loads(qnadatajson)
    cur.execute("INSERT INTO yahoo_questions (question_id, question_subj, question_content, question_userId, question_timestamp,"
                +"category_id, category_name, choosen_answer, choosen_userId, choosen_usernick, choosen_ans_timestamp)"
                +"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (row[2], row[5], row[6], quserId, questionTime, categoryId, categoryName, qChosenAnswer, choosenUserId, choosenNickName, choosenTimeStamp))
    

    json Structure

    questions: [
    {
    Id: "20111201185322AA5HTDc",
    Subject: "what are the new pokemon call?",
    Content: "I used to know them I stop at dialga and palkia version and I heard there's new ones what's it call
    ",
    Date: "2011-12-01 18:53:22",
    Timestamp: "1322794402",
    

    What I also did prior to running the query I execute the following on mysql SET character_set_client = utf8

    And this how the mysql variables looks like:

    mysql> SHOW variables LIKE '%character_set%';
    +--------------------------+--------------------------------------------------------+
    | Variable_name            | Value                                                  |
    +--------------------------+--------------------------------------------------------+
    | character_set_client     | utf8                                                   |
    | character_set_connection | utf8                                                   |
    | character_set_database   | latin1                                                 |
    | character_set_filesystem | binary                                                 |
    | character_set_results    | utf8                                                   |
    | character_set_server     | latin1                                                 |
    | character_set_system     | utf8                                                   |
    | character_sets_dir       | /usr/local/mysql-5.5.10-osx10.6-x86_64/share/charsets/ |
    +--------------------------+--------------------------------------------------------+
    8 rows in set (0.00 sec)