How to convert \xXY encoded characters to UTF-8 in Python?
Solution 1
Your file is already a UTF-8 encoded file.
# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()
import unicodedata as ud
chars= sorted(set(data))
for char in chars:
try:
charname= ud.name(char)
except ValueError:
charname= "<unknown>"
sys.stdout.write("char U%04x %s\n" % (ord(char), charname))
And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE
Solution 2
.encode
is for converting a Unicode string (unicode
in 2.x, str
in 3.x) to a a byte string (str
in 2.x, bytes
in 3.x).
In 2.x, it's legal to call .encode
on a str
object. Python implicitly decodes the string to Unicode first: s.encode(e)
works as if you had written s.decode(sys.getdefaultencoding()).encode(e)
.
The problem is that the default encoding is "ascii", and your string contains non-ASCII characters. You can solve this by explicitly specifying the correct encoding.
>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'
Solution 3
It's not ASCII (ASCII codes only go up to 127; \xaf
is 175). You first need to find out the correct encoding, decode that, and then re-encode in UTF-8.
Could you provide an actual string sample? Then we can probably guess the current encoding.
Jindřich Mynarz
(Linked) data engineer supporting pharmaceutical research
Updated on June 23, 2022Comments
-
Jindřich Mynarz almost 2 years
I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters.
I want to convert them in Python to their UTF-8 equivalents. The usual
string.encode("utf-8")
throwsUnicodeDecodeError
. Is there some better way, e.g., with thecodecs
standard library?Sample 200 characters here.
-
Tim Pietzcker over 13 yearsThat sample doesn't look like an encoded text to me, more like a proprietary format.
-
Jindřich Mynarz over 13 yearsIt should be in the MARC format (loc.gov/marc). When I tried to detect its encoding with
enca
I got response saying that it's mostly UTF-8 interspersed with non-text characters. -
Jindřich Mynarz over 13 yearsThat's fine but the rest of the text is encoded as UTF-8 (at least this was reported by
enca
). So this procedure cannot be applied for the whole text. -
Tim Pietzcker over 13 yearsSo it definitely is not a text format/encoding. This is not a problem you can solve with a correct encoding; you need a library that can read this "database". Something like this perhaps.
-
Jindřich Mynarz over 13 yearsYes, I'm already using the
pymarc
library to parse the file. The problem is that it can't parse it correctly because of these characters (\xaf...). So I'm trying to repair the file before passing it to the parser. -
Jindřich Mynarz about 13 yearsSo the \xXY characters are in ISO-8859-1?
-
Jindřich Mynarz about 13 yearsThanks, you're right the short sample I've provided is UTF-8. however (unfortunately) in the whole file, there are parts encoded in various other encodings (mostly windows-1250). I have solved this by
try
ing to"string".decode()
for the most common encodings and, if everything failed, guessing the encoding with thechardet
library.