python arabic encoding issue

python encoding

10,820

Solution 1

It looks like you have accidentally decoded the input as Windows-1252.

>>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
'سوي دوم بيهقي'

Solution 2

I would like to add to @josh-lee answer the case for python2.
If you are using python 2, add unicode prefix u.

>>> u"Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
u'\u0633\u0648\u064a \u062f\u0648\u0645 \u0628\u064a\u0647\u0642\u064a'
>>> print _
سوي دوم بيهقي

10,820

Amir Mohsen

Updated on September 16, 2022

Comments

Amir Mohsen over 1 year
i have a text with windows-1256 encoding. now i want to convert text from arabic(windows-1256) to utf-8

sample text :
```
Óæí Ïæã ÈíåÞí
```
result :
```
سوي دوم بيهقي
```
i use this code to decode and encod to utf-8
```
# -*- coding: utf-8 -*-

data = "Óæí Ïæã ÈíåÞí"
print data.decode("windows-1256", "replace")
print data.encode("windows-1256")
```
that code return this result:
```
أ“أ¦أ أڈأ¦أ£ أˆأأ¥أ‍أ
Traceback (most recent call last):
  File "mohmal2.py", line 5, in <module>
    print data.encode("windows-1256")
  File "/usr/lib/python2.7/encodings/cp1256.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
```
i found a site that can convert this text:

http://www.iosart.com
- MaddTheSane about 7 years
  
  Ó isn't in Windows-1256. What's your actual question?
- Amir Mohsen about 7 years
  
  i have xml response with header <?xml version="1.0" encoding="windows-1256"?> and i want to parse it !
- Martijn Pieters about 7 years
  
  Perhaps you have your system configured to expect CP1252 in the console and thus printing CP1256 data is simply misdisplayed. I think you need to take several steps back and first learn how to correctly assess what data you have (e.g. don't trust your console, use repr() to show you debug information about your data).
- MaddTheSane
  
  Also, upgrade to Python 3 if you have the chance at all.
- Martijn Pieters
  
  You do have a mojibake, but that's not the right codec. A mis-decoding as windows-1256 of the UTF8 encoded version of your input would look like this: 'ط³ظˆظٹ ط¯ظˆظ… ط¨ظٹظ‡ظ‚ظٹ'‎
- Martijn Pieters
  
  Also, before you continue, teach yourself about the difference between bytes and Unicode text. Start with Pragmatic Unicode, perhaps also read The Absolute Minimum... and end with the Python Unicode HOWTO.
Martijn Pieters about 7 years

Nice catch! I can indeed confirm that the Mojibake can be reproduced with original.encode('cp1256').decode('cp1252').
Amir Mohsen about 7 years

i test it but not work ! >>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Martijn Pieters about 7 years

@AmirMohsen: you are using bytes, so the codec used for that string is going to depend on your local console settings, and already is encoded. It may well be UTF-8. Use u"Óæí Ïæã ÈíåÞí" instead.
Martijn Pieters about 7 years

@AmirMohsen: specifically, str.encode() must first decode to get a unicode object that can then be encoded, which is why you get a UnicodeDecodeError for ASCII.
thethiny over 4 years

I know this is old but I wanted to add to @AmirMohsen 's issue. You can add encoding='cp1252' to your file open and it will bypass that error. This data gibberish you're getting is because you saved arabic text as ANSI instead of UTF-8.