psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0

10,601

Solution 1

Turns out there are a few options to solve this problem.

The option to change the clients encoding suggested by Phil does work.

cur.execute("SET CLIENT_ENCODING TO 'LATIN1';")

Another option is to convert the data in the fly. I used a python module called codecs to do this.

f = open(file, "w")
fx = codecs.EncodedFile(f,"LATIN1", "UTF8")
cur = connection.cursor()
cur.execute("SHOW client_encoding;")
print cur.fetchone()
cur.copy_to(fx, table)

The key line being

fx = codecs.EncodedFile(f,"LATIN1", "UTF8")

My main problem was that I was not committing my changes to the database! Silly me :)

Solution 2

I'm in the process of migrating from an SQL_ASCII database to a UTF8 database, and ran into the same problem. Based on this answer, I simply added this statement to the start of my import script:

set client_encoding to 'latin1'

and everything appears to have imported correctly.

Share:
10,601
Justin S
Author by

Justin S

Updated on July 22, 2022

Comments

  • Justin S
    Justin S almost 2 years

    I have done quite a bit of googling on this error and have boiled it down to the fact that the databases I am working with are in different encodings.

    The AIX server I am working with is running

    psql 8.2.4
    
     server_encoding                 | LATIN1          |      | Client Connection Defaults / Locale and Formatting                | Sets the server (database) character set encoding.
    

    The windows 2008 R2 server I am working with is running

    psql (9.3.4)

    CREATE DATABASE postgres
    WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'English_Australia.1252'
       LC_CTYPE = 'English_Australia.1252'
       CONNECTION LIMIT = -1;
    
    COMMENT ON DATABASE postgres
    IS 'default administrative connection database';
    

    Now when i try execute my below python script I get this error

    Traceback (most recent call last):
    File "datamain.py", line 39, in <module>
    sys.exit(main())
    File "datamain.py", line 33, in main
    write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838)
    File "datamain.py", line 21, in write_file_to_table
    cur.copy_from(f, table, ",")
    psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0
    CONTEXT:  COPY cms_jobdef, line 15209    
    

    Here is my script

    import psycopg2
    import StringIO
    import sys
    import pdb
    
    def connect_db(db, usr, pw, hst, prt):
        conn = psycopg2.connect(database=db, user=usr,
        password=pw, host=hst, port=prt)
        return conn
    
    def write_table_to_file(file, table, connection):
        f = open(file, "w")
        cur = connection.cursor()
        cur.copy_to(f, table, ",")
        f.close()
        cur.close()
    
    def write_file_to_table(file, table, connection):
        f = open(file,"r")
        cur = connection.cursor()
        cur.copy_from(f, table, ",")
        f.close()
        cur.close()
    
    def main():
        login = open('login.txt','r')
        con_tctmsv64 = connect_db("x", "y",
        login.readline().strip(),
        "d.domain", "c")
        con_S104838 = connect_db("x", "y", "z", "a", "b")
        try:
            write_table_to_file("cms_jobdef.txt", "cms_jobdef", con_tctmsv64)
            write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838)
        finally:
            con_tctmsv64.close()
            con_S104838.close()
    
    if __name__ == "__main__":
        sys.exit(main())
    

    have removed some sensitive data.

    So I'm not sure how I can proceed. As far as I can tell the copy_expert method might help by exporting as a UTF8 encoding. But because the server I am pulling the data from is running 8.2.4 I Dont think it supports COPY encoding format.

    I think my best shot is to try and reinstall the postgre database with an encoding of LATIN1 on the windows server. When I try and do that I get the below error.

    psql error

    So im quite stuck,any help would be greatly appreciated!

    Update I installed the postgre db on the windows as LATIN1 encoding by changing the default local to 'C'. This however gave me the below error and doesnt seem like a likely successful/correct approach

    enter image description here

    I have also tried encoding the files in BINARY using the PSQL COPY function

    def write_table_to_file(file, table, connection):
        f = open(file, "w")
        cur = connection.cursor()
        #cur.copy_to(f, table, ",")
        cur.copy_expert("COPY cms_jobdef TO STDOUT WITH BINARY", f)
        f.close()
        cur.close()
    
    def write_file_to_table(file, table, connection):
        f = open(file,"r")
        cur = connection.cursor()
        #cur.copy_from(f, table)
        cur.copy_expert("COPY cms_jobdef FROM STDOUT WITH BINARY", f)
        f.close()
        cur.close()
    

    Still no luck I get the same error

    DataError: invalid byte sequence for encoding "UTF8": 0xa0
    CONTEXT:  COPY cms_jobdef, line 15209, column descript
    

    In relation to Phils answer I have tried this approach with still no success.

    import psycopg2
    import StringIO
    import sys
    import pdb
    import codecs
    
    def connect_db(db, usr, pw, hst, prt):
        conn = psycopg2.connect(database=db, user=usr,
        password=pw, host=hst, port=prt)
        return conn
    
    def write_table_to_file(file, table, connection):
        f = open(file, "w")
        #fx = codecs.EncodedFile(f,"LATIN1", "UTF8")
        cur = connection.cursor()
        cur.execute("SHOW client_encoding;")
        print cur.fetchone()
        cur.copy_to(f, table)
        #cur.copy_expert("COPY cms_jobdef TO STDOUT WITH BINARY", f)
        f.close()
        cur.close()
    
    def write_file_to_table(file, table, connection):
        f = open(file,"r")
        cur = connection.cursor() 
        cur.execute("SET CLIENT_ENCODING TO 'LATIN1';")
        cur.execute("SHOW client_encoding;")
        print cur.fetchone()
        cur.copy_from(f, table)
        #cur.copy_expert("COPY cms_jobdef FROM STDOUT WITH BINARY", f)
        f.close()
        cur.close()
    
    def main(): 
        login = open('login.txt','r')
        con_tctmsv64 = connect_db("x", "y",
        login.readline().strip(),
        "ctmtest1.int.corp.sun", "5436")
        con_S104838 = connect_db("x", "y", "z", "t", "5432")
        try:
            write_table_to_file("cms_jobdef.txt", "cms_jobdef", con_tctmsv64)
            write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838)
        finally:
            con_tctmsv64.close()
            con_S104838.close()
    
    if __name__ == "__main__":
        sys.exit(main())
    

    output

    In [4]: %run datamain.py
    ('sql_ascii',)
    ('LATIN1',)
    
    In [5]: 
    

    This completes successfully but when i run a

    select * from cms_jobdef;
    

    Nothing is in the new database

    enter image description here

    I have even tried converting the file format from LATIN1 to UTF8. Still no luck

    The weird thing is when I do this process manually by only using the postgre COPY function it works. I have no idea why. Once again any help would be greatly appreciated.