python arabic encoding issue
10,820
Solution 1
It looks like you have accidentally decoded the input as Windows-1252.
>>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
'سوي دوم بيهقي'
Solution 2
I would like to add to @josh-lee answer the case for python2.
If you are using python 2, add unicode prefix u
.
>>> u"Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
u'\u0633\u0648\u064a \u062f\u0648\u0645 \u0628\u064a\u0647\u0642\u064a'
>>> print _
سوي دوم بيهقي
Related videos on Youtube
Author by
Amir Mohsen
Updated on September 16, 2022Comments
-
Amir Mohsen over 1 year
i have a text with windows-1256 encoding. now i want to convert text from arabic(windows-1256) to utf-8
sample text :
Óæí Ïæã ÈíåÞí
result :
سوي دوم بيهقي
i use this code to decode and encod to utf-8
# -*- coding: utf-8 -*- data = "Óæí Ïæã ÈíåÞí" print data.decode("windows-1256", "replace") print data.encode("windows-1256")
that code return this result:
أ“أ¦أ أڈأ¦أ£ أˆأأ¥أأ Traceback (most recent call last): File "mohmal2.py", line 5, in <module> print data.encode("windows-1256") File "/usr/lib/python2.7/encodings/cp1256.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
i found a site that can convert this text:
-
MaddTheSane about 7 years
Ó
isn't in Windows-1256. What's your actual question? -
Amir Mohsen about 7 yearsi have xml response with header
<?xml version="1.0" encoding="windows-1256"?>
and i want to parse it ! -
Martijn Pieters about 7 yearsPerhaps you have your system configured to expect CP1252 in the console and thus printing CP1256 data is simply misdisplayed. I think you need to take several steps back and first learn how to correctly assess what data you have (e.g. don't trust your console, use
repr()
to show you debug information about your data). -
MaddTheSaneAlso, upgrade to Python 3 if you have the chance at all.
-
Martijn PietersYou do have a mojibake, but that's not the right codec. A mis-decoding as windows-1256 of the UTF8 encoded version of your input would look like this:
'ط³ظˆظٹ ط¯ظˆظ… ط¨ظٹظ‡ظ‚ظٹ'
-
Martijn PietersAlso, before you continue, teach yourself about the difference between bytes and Unicode text. Start with Pragmatic Unicode, perhaps also read The Absolute Minimum... and end with the Python Unicode HOWTO.
-
-
Martijn Pieters about 7 yearsNice catch! I can indeed confirm that the Mojibake can be reproduced with
original.encode('cp1256').decode('cp1252')
. -
Amir Mohsen about 7 yearsi test it but not work !
>>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
-
Martijn Pieters about 7 years@AmirMohsen: you are using bytes, so the codec used for that string is going to depend on your local console settings, and already is encoded. It may well be UTF-8. Use
u"Óæí Ïæã ÈíåÞí"
instead. -
Martijn Pieters about 7 years@AmirMohsen: specifically,
str.encode()
must first decode to get aunicode
object that can then be encoded, which is why you get a UnicodeDecodeError for ASCII. -
thethiny over 4 yearsI know this is old but I wanted to add to @AmirMohsen 's issue. You can add
encoding='cp1252'
to your file open and it will bypass that error. This data gibberish you're getting is because you saved arabic text as ANSI instead of UTF-8.