How to convert \xXY encoded characters to UTF-8 in Python?

16,587

Solution 1

Your file is already a UTF-8 encoded file.

# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()

import unicodedata as ud

chars= sorted(set(data))
for char in chars:
    try:
        charname= ud.name(char)
    except ValueError:
        charname= "<unknown>"
    sys.stdout.write("char U%04x %s\n" % (ord(char), charname))

And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE

Solution 2

.encode is for converting a Unicode string (unicode in 2.x, str in 3.x) to a a byte string (str in 2.x, bytes in 3.x).

In 2.x, it's legal to call .encode on a str object. Python implicitly decodes the string to Unicode first: s.encode(e) works as if you had written s.decode(sys.getdefaultencoding()).encode(e).

The problem is that the default encoding is "ascii", and your string contains non-ASCII characters. You can solve this by explicitly specifying the correct encoding.

>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'

Solution 3

It's not ASCII (ASCII codes only go up to 127; \xaf is 175). You first need to find out the correct encoding, decode that, and then re-encode in UTF-8.

Could you provide an actual string sample? Then we can probably guess the current encoding.

Share:
16,587
Jindřich Mynarz
Author by

Jindřich Mynarz

(Linked) data engineer supporting pharmaceutical research

Updated on June 23, 2022

Comments

  • Jindřich Mynarz
    Jindřich Mynarz almost 2 years

    I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters.

    I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8") throws UnicodeDecodeError. Is there some better way, e.g., with the codecs standard library?

    Sample 200 characters here.

  • Tim Pietzcker
    Tim Pietzcker over 13 years
    That sample doesn't look like an encoded text to me, more like a proprietary format.
  • Jindřich Mynarz
    Jindřich Mynarz over 13 years
    It should be in the MARC format (loc.gov/marc). When I tried to detect its encoding with enca I got response saying that it's mostly UTF-8 interspersed with non-text characters.
  • Jindřich Mynarz
    Jindřich Mynarz over 13 years
    That's fine but the rest of the text is encoded as UTF-8 (at least this was reported by enca). So this procedure cannot be applied for the whole text.
  • Tim Pietzcker
    Tim Pietzcker over 13 years
    So it definitely is not a text format/encoding. This is not a problem you can solve with a correct encoding; you need a library that can read this "database". Something like this perhaps.
  • Jindřich Mynarz
    Jindřich Mynarz over 13 years
    Yes, I'm already using the pymarc library to parse the file. The problem is that it can't parse it correctly because of these characters (\xaf...). So I'm trying to repair the file before passing it to the parser.
  • Jindřich Mynarz
    Jindřich Mynarz about 13 years
    So the \xXY characters are in ISO-8859-1?
  • Jindřich Mynarz
    Jindřich Mynarz about 13 years
    Thanks, you're right the short sample I've provided is UTF-8. however (unfortunately) in the whole file, there are parts encoded in various other encodings (mostly windows-1250). I have solved this by trying to "string".decode() for the most common encodings and, if everything failed, guessing the encoding with the chardet library.