"surrogateescape" cannot escape certain characters

10,187

Solution 1

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII?

Because that's what it explicitly does. That way you can use the same error handler the other way and it will know what to do.

3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
"'Zo\udcc3\udcab\\'s'"
3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
b"'Zo\xc3\xab\\'s'"

Solution 2

A lone surrogate should NOT be encoded in UTF-8 -- which is precisely why it was used for the internal representation of invalid input.

In real life, it is pretty common to get data that is invalid for the encoding it is "supposed" to be in. For example, this question was inspired by text that appears to be in Latin-1, when ASCII or UTF-8 was expected. I put "supposed" in quotes, because it is pretty common for the "encoding information" to just be a guess, perhaps unrelated to the actual file.

By default, xml processing (and most unicode processing) is strict -- the entire process gives up even though it could process hundreds of other lines just fine.

Decoding with errors=replace would turn that line into "Zo?'s Coffee House", which is an improvement. (Well, unless you tried to replace invalid characters with something else that isn't valid either -- and the official unicode replacement character isn't valid in ASCII, which is why a '?' is typically used for encoding.)

surrogateescape is used when the programmer decides "You know what? I don't care if the data is garbage. Maybe I have the wrong codec ... so I'll just pass the unknown bytes along as-is." Python does have to store (but avoid interpreting) those bytes internally until they are passed along.

Using unpaired surrogates allows Python to store the invalid bytes without extra escaping. Precisely because unpaired surrogates are invalid, they will never appear in valid input. (And if they occur anyhow, they'll be interpreted as a pair of unrecognized bytes, both of which get preserved for output.)

The original poster's problem is that he was trying to print out that internal representation directly, instead of reversing the mapping first, and the internal representation had bytes that (intentionally) weren't valid ... so the default (strict) error handler refused.

Share:
10,187
dotancohen
Author by

dotancohen

I currently develop and support the backends of a few LAMP-stack based web applications for BSS (Business Support Services) that my company specializes in. I have experience in software project management, business process development, and I ran a software development business for a short time (actually twice). I have been using PHP since 1998 or '99, and I'm reasonably competent in the associated client-side technologies. I find myself using Python often, mostly for my own personal projects, I'm quite poetic in VIM, and of course Git is a cornerstone of my development. Lately I have been experimenting with machine learning, mostly with scikit-learn.

Updated on July 18, 2022

Comments

  • dotancohen
    dotancohen over 1 year

    Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape Unicode Error Handler:

    [surrogateescape] handles decoding errors by squirreling the data away in a little used part of the Unicode code point space. When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly.

    However, while opening a file and then attempting to write the output to another file:

    input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
    output_file = open('anotherFile.txt', 'w')
    
    for line in input_file:
        output_file.write(line)
    

    Results in:

      File "./break-50000.py", line 37, in main
        output_file.write(line)
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed
    

    Note that the input file is not ASCII. However, it transverses hundreds of lines that contain non-ASCII characters just fine before it throws the exception on one particular line. The output file must be ASCII and loosing some characters is just fine.

    This is the line that is throwing the error when decoded as UTF-8:

    'Zoë\'s Coffee House'

    This is the hex encoding:

    $ cat z.txt | hd
    00000000  27 5a 6f c3 ab 5c 27 73  20 43 6f 66 66 65 65 20  |'Zo..\'s Coffee |
    00000010  48 6f 75 73 65 27 0a                              |House'.|
    00000017
    

    Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII? This is with Python 3.2.3 on Kubuntu Linux 12.10.

  • dotancohen
    dotancohen almost 10 years
    Thank you Ignacio, that did it!
  • dotancohen
    dotancohen almost 10 years
    Thank you brighty. What is a DCC3? I've tried to search for what it might be, but I see nothing that seems relevant. I don't understand the rest of the answer either, but hopefully I'll be able to make sense of it after I learn what DCC3 is. Thanks.
  • brighty
    brighty almost 10 years
    Ok, i'll explain it. Regarding utf-16 we talk about a stream of words, okay. A word has 16 bits, that's why there is the name utf-16. Now words in the range U+DC00 bis U+DFFF are Low-Surrogates, words in the range U+D800 bis U+DBFF are High-Surrogates. That's why 0xDCC3 is a Low-Surrogate, understand?
  • brighty
    brighty almost 10 years
    Due to the given range, the first 6 bits of a surrogate identifies it as surrogate. The remaining 10 bits are 50% of the encoded codepoint value of the character. The other 50% of the codepoints character resides usually within the high-surrogate. That's why usually a high- and low-surrogate appears as a pair. To the 10 bits of the high-surrogate and the 10 bits of the low-surrogate which results in 20 bits the constant 0x10000 is added and then you'll have the codepoint with a maximum of 21 bits (up to 0x10ffff).
  • brighty
    brighty almost 10 years
    That's why a single high/low surrogate is just a container, used to encode a codepoint higher than 0xffff as a pair, becoming a DWORD with the encoded codepoint inside. In utf-32 you have 32 bits so no surrogates are needed and so not allowed, in utf-8 you'll decode codepoints higher than 127 in 2-, 3- or 4-byte sequences so surrogates are disallowed in utf-8 as well.
  • brighty
    brighty almost 10 years
    Conclusion: If in a word stream - and that is utf-16 - you'll have to place a character with a codepoint higher than 0xffff you get into trouble, because a word's range stops at 0xffff. Okay!? So the solution is to take 2 words. But how can we identify that the two words are a pair and decode a codepoint higher than 0xffff? The answer is that the unicode inventors invented "reserved" blocks of words, the so called high- and low-surrogates that must appear as a pair within a utf-16 stream, encoding a codepoint higher than 0xffff. Hope this helps.
  • brighty
    brighty almost 10 years
    The low-surrogate DCC3 is dual 1101110011000011, so 110111 identifies it as a low-surrogate, 0011000011 are 10 bits payload bits that are part of the character's codepoint. In order to get the complete codepoint, you'll need the high-surrogate's 10 payload bits as well. Remember surrogates appear as a pair, with just your low-surrogate DCC3 you cannot decode the original codepoint. Think about that in utf-16 a word is a child having the codepoints number up to 0xffff, for everything higher than 0xffff you'll see twins, 2 kids that together know the codepoints number.