psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0
Solution 1
Turns out there are a few options to solve this problem.
The option to change the clients encoding suggested by Phil does work.
cur.execute("SET CLIENT_ENCODING TO 'LATIN1';")
Another option is to convert the data in the fly. I used a python module called codecs to do this.
f = open(file, "w")
fx = codecs.EncodedFile(f,"LATIN1", "UTF8")
cur = connection.cursor()
cur.execute("SHOW client_encoding;")
print cur.fetchone()
cur.copy_to(fx, table)
The key line being
fx = codecs.EncodedFile(f,"LATIN1", "UTF8")
My main problem was that I was not committing my changes to the database! Silly me :)
Solution 2
I'm in the process of migrating from an SQL_ASCII database to a UTF8 database, and ran into the same problem. Based on this answer, I simply added this statement to the start of my import script:
set client_encoding to 'latin1'
and everything appears to have imported correctly.
Justin S
Updated on July 22, 2022Comments
-
Justin S almost 2 years
I have done quite a bit of googling on this error and have boiled it down to the fact that the databases I am working with are in different encodings.
The AIX server I am working with is running
psql 8.2.4 server_encoding | LATIN1 | | Client Connection Defaults / Locale and Formatting | Sets the server (database) character set encoding.
The windows 2008 R2 server I am working with is running
psql (9.3.4)
CREATE DATABASE postgres WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'English_Australia.1252' LC_CTYPE = 'English_Australia.1252' CONNECTION LIMIT = -1; COMMENT ON DATABASE postgres IS 'default administrative connection database';
Now when i try execute my below python script I get this error
Traceback (most recent call last): File "datamain.py", line 39, in <module> sys.exit(main()) File "datamain.py", line 33, in main write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838) File "datamain.py", line 21, in write_file_to_table cur.copy_from(f, table, ",") psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0 CONTEXT: COPY cms_jobdef, line 15209
Here is my script
import psycopg2 import StringIO import sys import pdb def connect_db(db, usr, pw, hst, prt): conn = psycopg2.connect(database=db, user=usr, password=pw, host=hst, port=prt) return conn def write_table_to_file(file, table, connection): f = open(file, "w") cur = connection.cursor() cur.copy_to(f, table, ",") f.close() cur.close() def write_file_to_table(file, table, connection): f = open(file,"r") cur = connection.cursor() cur.copy_from(f, table, ",") f.close() cur.close() def main(): login = open('login.txt','r') con_tctmsv64 = connect_db("x", "y", login.readline().strip(), "d.domain", "c") con_S104838 = connect_db("x", "y", "z", "a", "b") try: write_table_to_file("cms_jobdef.txt", "cms_jobdef", con_tctmsv64) write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838) finally: con_tctmsv64.close() con_S104838.close() if __name__ == "__main__": sys.exit(main())
have removed some sensitive data.
So I'm not sure how I can proceed. As far as I can tell the copy_expert method might help by exporting as a UTF8 encoding. But because the server I am pulling the data from is running 8.2.4 I Dont think it supports COPY encoding format.
I think my best shot is to try and reinstall the postgre database with an encoding of LATIN1 on the windows server. When I try and do that I get the below error.
So im quite stuck,any help would be greatly appreciated!
Update I installed the postgre db on the windows as LATIN1 encoding by changing the default local to 'C'. This however gave me the below error and doesnt seem like a likely successful/correct approach
I have also tried encoding the files in BINARY using the PSQL COPY function
def write_table_to_file(file, table, connection): f = open(file, "w") cur = connection.cursor() #cur.copy_to(f, table, ",") cur.copy_expert("COPY cms_jobdef TO STDOUT WITH BINARY", f) f.close() cur.close() def write_file_to_table(file, table, connection): f = open(file,"r") cur = connection.cursor() #cur.copy_from(f, table) cur.copy_expert("COPY cms_jobdef FROM STDOUT WITH BINARY", f) f.close() cur.close()
Still no luck I get the same error
DataError: invalid byte sequence for encoding "UTF8": 0xa0 CONTEXT: COPY cms_jobdef, line 15209, column descript
In relation to Phils answer I have tried this approach with still no success.
import psycopg2 import StringIO import sys import pdb import codecs def connect_db(db, usr, pw, hst, prt): conn = psycopg2.connect(database=db, user=usr, password=pw, host=hst, port=prt) return conn def write_table_to_file(file, table, connection): f = open(file, "w") #fx = codecs.EncodedFile(f,"LATIN1", "UTF8") cur = connection.cursor() cur.execute("SHOW client_encoding;") print cur.fetchone() cur.copy_to(f, table) #cur.copy_expert("COPY cms_jobdef TO STDOUT WITH BINARY", f) f.close() cur.close() def write_file_to_table(file, table, connection): f = open(file,"r") cur = connection.cursor() cur.execute("SET CLIENT_ENCODING TO 'LATIN1';") cur.execute("SHOW client_encoding;") print cur.fetchone() cur.copy_from(f, table) #cur.copy_expert("COPY cms_jobdef FROM STDOUT WITH BINARY", f) f.close() cur.close() def main(): login = open('login.txt','r') con_tctmsv64 = connect_db("x", "y", login.readline().strip(), "ctmtest1.int.corp.sun", "5436") con_S104838 = connect_db("x", "y", "z", "t", "5432") try: write_table_to_file("cms_jobdef.txt", "cms_jobdef", con_tctmsv64) write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838) finally: con_tctmsv64.close() con_S104838.close() if __name__ == "__main__": sys.exit(main())
output
In [4]: %run datamain.py ('sql_ascii',) ('LATIN1',) In [5]:
This completes successfully but when i run a
select * from cms_jobdef;
Nothing is in the new database
I have even tried converting the file format from LATIN1 to UTF8. Still no luck
The weird thing is when I do this process manually by only using the postgre COPY function it works. I have no idea why. Once again any help would be greatly appreciated.