Reading UTF-8 XML and writing it to a file with Python

10,593

You'll need to remove the call to encode() - that is, replace nodeValue.encode("utf-8") with nodeValue - and then change the call to open() to

with open("uiStrings-fi.py", "w", "utf-8") as f:

This uses a "Unicode-aware" version of open() which you will need to import from the codecs module, so also add

from codecs import open

to the top of the file.

The issue is that when you were calling nodeValue.encode("utf-8"), you were converting a Unicode string (Python's internal representation that can store all Unicode characters) into a regular string (which can only store single-byte characters 0-255). Later on, when you construct the line to write to the output file, names[i] is still a Unicode string but values[i] is a regular string. Python tries to convert the regular string to Unicode, which is the more general type, but because you don't specify an explicit conversion, it uses the ASCII codec, which is the default, and ASCII can't handle characters with byte values greater than 127. Unfortunately, several of those do occur in the string values[i] because the UTF-8 encoding uses those upper-range bytes frequently. So Python complains that it sees a character it can't handle. The solution, as I said above, is to defer the conversion from Unicode to bytes until the last possible moment, and you do that by using the Unicode-aware version of open (which will handle the encoding for you).

Now that I think about it, instead of what I said above, an alternate solution would be to replace names[i] with names[i].encode("utf-8"). That way, you convert names[i] into a regular string as well, and Python has no reason to try to convert values[i] back to Unicode. Although, one could make the argument that it's good practice to keep your strings as Unicode objects until you write them out to the file... if nothing else, I believe unicode becomes the default in Python 3.

Share:
10,593
Harri
Author by

Harri

Updated on June 28, 2022

Comments

  • Harri
    Harri almost 2 years

    I'm trying to parse UTF-8 XML file and save some parts of it to another file. Problem is, that this is my first Python script ever and I'm totally confused about the character encoding problems I'm finding.

    My script fails immediately when it tries to write non-ascii character to a file, but it can print it to command prompt (at least in some level)

    Here's the XML (from the parts that matter at least, it's a *.resx file which contains UI strings)

    <?xml version="1.0" encoding="utf-8"?>
    <root>
         <resheader name="foo">
              <value>bar</value>
         </resheader>
         <data name="lorem" xml:space="preserve">
              <value>ipsum öä</value>
         </data>
    </root>
    

    And here's my python script

    from xml.dom.minidom import parse
    
    names = []
    values = []
    
    def getStrings(path):
        dom = parse(path)
        data = dom.getElementsByTagName("data")
    
        for i in range(len(data)):
            name = data[i].getAttribute("name")
            names.append(name)
            value = data[i].getElementsByTagName("value")
            values.append(value[0].firstChild.nodeValue.encode("utf-8"))
    
    def writeToFile():
        with open("uiStrings-fi.py", "w") as f:
            for i in range(len(names)):
                line = names[i] + '="'+ values[i] + '"' #varName='varValue'
                f.write(line)
                f.write("\n")
    
    getStrings("ResourceFile.fi-FI.resx")
    writeToFile()
    

    And here's the traceback:

    Traceback (most recent call last):
      File "GenerateLanguageFiles.py", line 24, in 
        writeToFile()
      File "GenerateLanguageFiles.py", line 19, in writeToFile
        line = names[i] + '="'+ values[i] + '"' #varName='varValue'
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in ran
    ge(128)
    

    How should I fix my script so it would read and write UTF-8 characters properly? The files I'm trying to generate would be used in test automation with Robots Framework.