python arabic encoding issue

10,820

Solution 1

It looks like you have accidentally decoded the input as Windows-1252.

>>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
'سوي دوم بيهقي'

Solution 2

I would like to add to @josh-lee answer the case for python2.
If you are using python 2, add unicode prefix u.

>>> u"Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
u'\u0633\u0648\u064a \u062f\u0648\u0645 \u0628\u064a\u0647\u0642\u064a'
>>> print _
سوي دوم بيهقي
Share:
10,820

Related videos on Youtube

Amir Mohsen
Author by

Amir Mohsen

Updated on September 16, 2022

Comments

  • Amir Mohsen
    Amir Mohsen over 1 year

    i have a text with windows-1256 encoding. now i want to convert text from arabic(windows-1256) to utf-8

    sample text :

    Óæí Ïæã ÈíåÞí
    

    result :

    سوي دوم بيهقي
    

    i use this code to decode and encod to utf-8

    # -*- coding: utf-8 -*-
    
    data = "Óæí Ïæã ÈíåÞí"
    print data.decode("windows-1256", "replace")
    print data.encode("windows-1256")
    

    that code return this result:

    أ“أ¦أ­ أڈأ¦أ£ أˆأ­أ¥أ‍أ­
    Traceback (most recent call last):
      File "mohmal2.py", line 5, in <module>
        print data.encode("windows-1256")
      File "/usr/lib/python2.7/encodings/cp1256.py", line 12, in encode
        return codecs.charmap_encode(input,errors,encoding_table)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
    

    i found a site that can convert this text:

    http://www.iosart.com

    • MaddTheSane
      MaddTheSane about 7 years
      Ó isn't in Windows-1256. What's your actual question?
    • Amir Mohsen
      Amir Mohsen about 7 years
      i have xml response with header <?xml version="1.0" encoding="windows-1256"?> and i want to parse it !
    • Martijn Pieters
      Martijn Pieters about 7 years
      Perhaps you have your system configured to expect CP1252 in the console and thus printing CP1256 data is simply misdisplayed. I think you need to take several steps back and first learn how to correctly assess what data you have (e.g. don't trust your console, use repr() to show you debug information about your data).
    • MaddTheSane
      MaddTheSane
      Also, upgrade to Python 3 if you have the chance at all.
    • Martijn Pieters
      Martijn Pieters
      You do have a mojibake, but that's not the right codec. A mis-decoding as windows-1256 of the UTF8 encoded version of your input would look like this: 'ط³ظˆظٹ ط¯ظˆظ… ط¨ظٹظ‡ظ‚ظٹ'
    • Martijn Pieters
      Martijn Pieters
      Also, before you continue, teach yourself about the difference between bytes and Unicode text. Start with Pragmatic Unicode, perhaps also read The Absolute Minimum... and end with the Python Unicode HOWTO.
  • Martijn Pieters
    Martijn Pieters about 7 years
    Nice catch! I can indeed confirm that the Mojibake can be reproduced with original.encode('cp1256').decode('cp1252').
  • Amir Mohsen
    Amir Mohsen about 7 years
    i test it but not work ! >>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
  • Martijn Pieters
    Martijn Pieters about 7 years
    @AmirMohsen: you are using bytes, so the codec used for that string is going to depend on your local console settings, and already is encoded. It may well be UTF-8. Use u"Óæí Ïæã ÈíåÞí" instead.
  • Martijn Pieters
    Martijn Pieters about 7 years
    @AmirMohsen: specifically, str.encode() must first decode to get a unicode object that can then be encoded, which is why you get a UnicodeDecodeError for ASCII.
  • thethiny
    thethiny over 4 years
    I know this is old but I wanted to add to @AmirMohsen 's issue. You can add encoding='cp1252' to your file open and it will bypass that error. This data gibberish you're getting is because you saved arabic text as ANSI instead of UTF-8.