How to convert \xXY encoded characters to UTF-8 in Python?

python unicode utf-8 character-encoding non-ascii-characters

16,587

Solution 1

Your file is already a UTF-8 encoded file.

# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()

import unicodedata as ud

chars= sorted(set(data))
for char in chars:
    try:
        charname= ud.name(char)
    except ValueError:
        charname= "<unknown>"
    sys.stdout.write("char U%04x %s\n" % (ord(char), charname))

And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE

Solution 2

.encode is for converting a Unicode string (unicode in 2.x, str in 3.x) to a a byte string (str in 2.x, bytes in 3.x).

In 2.x, it's legal to call .encode on a str object. Python implicitly decodes the string to Unicode first: s.encode(e) works as if you had written s.decode(sys.getdefaultencoding()).encode(e).

The problem is that the default encoding is "ascii", and your string contains non-ASCII characters. You can solve this by explicitly specifying the correct encoding.

>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'

Solution 3

It's not ASCII (ASCII codes only go up to 127; \xaf is 175). You first need to find out the correct encoding, decode that, and then re-encode in UTF-8.

Could you provide an actual string sample? Then we can probably guess the current encoding.

16,587

Author by

Jindřich Mynarz

(Linked) data engineer supporting pharmaceutical research

Updated on June 23, 2022

Comments

Jindřich Mynarz almost 2 years

I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters.

I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8") throws UnicodeDecodeError. Is there some better way, e.g., with the codecs standard library?

Sample 200 characters here.
Tim Pietzcker over 13 years

That sample doesn't look like an encoded text to me, more like a proprietary format.
Jindřich Mynarz over 13 years

It should be in the MARC format (loc.gov/marc). When I tried to detect its encoding with enca I got response saying that it's mostly UTF-8 interspersed with non-text characters.
Jindřich Mynarz over 13 years

That's fine but the rest of the text is encoded as UTF-8 (at least this was reported by enca). So this procedure cannot be applied for the whole text.
Tim Pietzcker over 13 years

So it definitely is not a text format/encoding. This is not a problem you can solve with a correct encoding; you need a library that can read this "database". Something like this perhaps.
Jindřich Mynarz over 13 years

Yes, I'm already using the pymarc library to parse the file. The problem is that it can't parse it correctly because of these characters (\xaf...). So I'm trying to repair the file before passing it to the parser.
Jindřich Mynarz about 13 years

So the \xXY characters are in ISO-8859-1?
Jindřich Mynarz about 13 years

Thanks, you're right the short sample I've provided is UTF-8. however (unfortunately) in the whole file, there are parts encoded in various other encodings (mostly windows-1250). I have solved this by trying to "string".decode() for the most common encodings and, if everything failed, guessing the encoding with the chardet library.