python: open and read a file containing german umlauts as unicode

python sqlite unicode utf-8 diacritics

19,788

Solution 1

I could sort out the problem. Thanks for the helps.

Here it is:

# -*- coding: iso-8859-1 -*-

import sys 
import codecs
import sqlite3

f = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two
text_in_unicode = f.read()                          # comma-separated words: süß, sweet 
f.close()

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()

con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')    

[ger,eng] = text_in_unicode.split(',')

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))       

con.commit()

sentence = "The German word is: %s" %(ger,)

print sentence.encode(stdout_encoding)

con.close()

I got some help from this page (it's in German)

and the output is:

The German word is: ?süß

Still a small problem is the '?'. I thought that the unicode u' is replaced by ? after encoding. sentence gives:

>>> sentence
u'The German word is: \ufeffs\xfc\xdf '

and encoded sentence gives:

>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '

so it was not what I thought.

A simple solution comes to my mind, to get rid of the question mark is to use the replace function:

sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')

>>> print(to_print)
The German word is: süß

Thank you SO :)

Solution 2

When you open and read a file, you get 8-bit strings not Unicode. In Python 2 to get a Unicode string instead use codecs.open to open the file:

f=codecs.open(filename, 'r', 'utf-8')

Hopefully though you've moved on to Python 3, where the encoding was put into the regular open call. Also unless you open with the 'b' flag for binary, you'll always get Unicode strings not 8-bit binary strings and a default encoding will be used if you don't specify one.

f=open(filename, 'r', encoding='utf-8')

Of course depending on how the file was written you may need to use 'iso-8859-15' instead.

Edit: one big difference between your test code and the commented out code is that reading from the file produces a list, while the test is a single string. Perhaps your problem isn't related to Unicode at all. Try making this substitution in your test code and see if it produces the same error:

text = [u'süß']

Unfortunately I don't have enough experience with SQL in Python to help you further.

Also when you print a list instead of a single string, the Unicode characters will be replaced with their equivalent escape sequences. To see what the strings really look like, print them one at a time. If you're curious it's the difference between __str__ and __repr__.

Edit 2: The character u'\ufeff' is known as a Byte Order Mark or BOM and is inserted by some editors to indicate that the file is truly UTF-8. You should get rid of it before you use the string. There should only be one at the very beginning of the file. See e.g. Reading Unicode file data with BOM chars in Python

19,788

Amin

I am a physical chemist, interested in scientific and general programming.

Updated on June 04, 2022

Comments

Amin almost 2 years
I have written my program to read words from a text file and enter them into a sqlite database and also treat them as strings. But I need to enter some words containing German umlauts: ä, ö, ü, ß.

Here is a prepared piece of code:

I tried both with # -- coding: iso-8859-15 -- and # -- coding: utf-8 -- No difference(!)
```
    # -*- coding: iso-8859-15 -*-
    import sqlite3
    
    dbname = 'sampledb.db'
    filename ='text.txt'


    con = sqlite3.connect(dbname)
    cur = con.cursor()
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')    

    #f=open(filename)
    #text = f.readlines()
    #f.close()

    text = u'süß'

    print (text)
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))       

    con.commit()

    sentence = "The name is: %s" %(text,)

    print (sentence)
    f.close()
    con.close()
```
the above code runs well. But I need to read 'text' from a file containing the word 'süß'. So when I uncomment the 3 lines ( f.open(filename) .... ), and commenting text = u'süß' it brings the error
```
    sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.
```
I tried codecs module to read a utf-8, iso-8859-15. But I could not decode them to the string 'süß' which I need to complete my sentence at the end of the code.

Once I tried decoding to utf-8 before inserting into the database. It worked, but I could not use it as string.

Is there a way I can import süß from a file and use it both for inserting to sqlite and using as string?

more detail:

Here I add more details for clarification. I have used codecs.open before. The text file containing the word süß is saved as utf-8. Using f=codecs.open(filename, 'r', 'utf-8') and text=f.read(), I read the file as unicode u'\ufeffs\xfc\xdf'. Inserting this unicode in sqlite3 is smoothly done: cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,)).

The problem is here: sentence = "The name is: %s" %(text,) gives u'The name is: \ufeffs\xfc\xdf', and I also need to print(text) as my output süß, while print(text) brings this error UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>.

Thank you.
- Mark Ransom about 10 years
  
  The coding parameter should have made a big difference in your text literal.
- jfs about 10 years
  
  to clarify: the coding declaration at the top of the module affects text = u'süß' specified in the source code. It has no effect on the text read from the file. You could use codecs.open() for the latter.
- alexis about 10 years
  
  readlines returns a list. Use f.read().strip() to get the text of the file as a string. Then you can start worrying about encodings.
Amin about 10 years

I added more details to the question.
Mark Ransom about 10 years

@Amin, next time you tell somebody you've added details to the question, do it after your edit. I completely missed it before I did my own edit.
GregarityNow almost 4 years

Using 'iso-8859-15' instead of utf-8 worked for me, thanks for that tip!
Mark Ransom almost 4 years

@Uzebeckatrente because of the way iso-8859 is structured, it won't generate an error on any input. But that doesn't mean you're getting the proper characters, you should double-check for that.
GregarityNow almost 4 years

I did some random checks, looked ziemlich schön :)