Python3 convert Unicode String to int representation

28,434

Solution 1

You are looking for the ord() function, I think:

>>> ord('a')
97
>>> ord('\u00c2')
192

This gives you the integer number for the Unicode codepoint.

To convert a whole set of characters use a list comprehension:

>>> [ord(c) for c in 'Hello World!']
[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]

It's inverse is the chr() function:

>>> chr(97)
'a'
>>> chr(193)
'Á'

Note that when you encrypt end decrypt text, you usually encode text to a binary representation with a character encoding. Unicode text can be encoded with different encodings with different advantages and disadvantages. These days the most commonly used encoding for Unicode text UTF-8, but others exist to.

In Python 3, binary data is represented in the bytes object, and you encode text to bytes with the str.encode() method and go back by using bytes.decode():

>>> 'Hello World!'.encode('utf8')
b'Hello World!'
>>> b'Hello World!'.decode('utf8')
'Hello World!'

bytes values are really just sequences, like lists and tuples and strings, but consisting of integer numbers from 0-255:

>>> list('Hello World!'.encode('utf8'))
[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]

Personally, when encrypting, you want to encode and encrypt the resulting bytes.

If all this seems overwhelming or hard to follow, perhaps these articles on Unicode and character encodings can help out:

Solution 2

The usual way to convert the Unicode string to a number is to convert it to the sequence of bytes. The Unicode characters are pure abstraction, each character has its own number; however, there is more ways to convert the numbers to the stream of bytes. Probably the most versatile way of doing that is to encode the string to the UTF-8 encoding. You can choose many ways to get integer number from it. Here is one (I have borrowed the nice string from Ivella -- I hope no bad words are inside :) :

Python 3.2.1 (default, Jul 10 2011, 20:02:51) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> s = "Hello, World, عالَم, ދުނިޔެ, जगत, 世界"
>>> b = s.encode('utf-8')
>>> b
b'Hello, World, \xd8\xb9\xd8\xa7\xd9\x84\xd9\x8e\xd9\x85, \xde\x8b\xde\xaa\xde\x82\xde\xa8\xde\x94\xde\xac, \xe0\xa4\x9c\xe0\xa4\x97\xe0\xa4\xa4, \xe4\xb8\x96\xe7\x95\x8c'

Now we have sequence of bytes where the ones with the number from 128 to 255 are displayed as hex-coded escape sequences. Let's convert all bytes to their hexcodes as a bytestring.

>>> import binascii
>>> h = binascii.hexlify(b)
>>> h
b'48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'

And we can look at it as at a big number written (as text) in hexadecimal notation. The int allows us to convert it to the abstract number that--when printed--is more usually converted to decimal notation.

>>> i = int(h, 16)
>>> i
52620351230730152682202055464811384749235956796562762198329268116226267262806875102376740945811764490696968801603738907493997296927348108

Now you can store it as a number, encrypt it (although it is more usual to encrypt the earlier sequence of bytes), and later convert it back to the integer. Beware, there is not many languages (and probably no database) that are able to work with that big integers.

Let's go back to the original string. Firstly convert it to the hexadecimal representation (string).

>>> h2 = hex(i)
>>> h2
'0x48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'
>>> h3 = h2[2:]   # remove the 0x from the beginning
>>> h3
'48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'
>>> type(h3)
<class 'str'>

We had to remove the 0x as it only says that the rest are the hexadecimal characters that represent the number. Notice that the h3 is of the str type. As we are in Python 3 (see the top), the str means Unicode string. The next step is to convert the couples of hexa numerals back to bytes. Let's try unhexlify():

>>> binascii.unhexlify(h3)
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    binascii.unhexlify(h3)
TypeError: 'str' does not support the buffer interface

Oops! it accept only bytestrings. Then, encode each hexa numeral in Unicode to hexa numeral in the bytestring. The way to go is to encode; however, encoding to ASCII is trivial.

>>> b2 = h3.encode('ascii')  # character by character; subset of ascii only
>>> b2
b'48656c6c6f2c20576f726c642c20d8b9d8a7d984d98ed9852c20de8bdeaade82dea8de94deac2c20e0a49ce0a497e0a4a42c20e4b896e7958c'
>>> b3 = binascii.unhexlify(b2)
>>> b3
b'Hello, World, \xd8\xb9\xd8\xa7\xd9\x84\xd9\x8e\xd9\x85, \xde\x8b\xde\xaa\xde\x82\xde\xa8\xde\x94\xde\xac, \xe0\xa4\x9c\xe0\xa4\x97\xe0\xa4\xa4, \xe4\xb8\x96\xe7\x95\x8c'

Now we have similar bytestring as after the first .encode('utf-8'). Let's use the inverse operation -- decode from UTF-8. We should get the same Unicode string that we started with.

>>> s2 = b3.decode('utf-8')
>>> s2
'Hello, World, عالَم, ދުނިޔެ, जगत, 世界'
>>> s == s2   # is the original equal to the result?
True

:)

Solution 3

From python's documentation:

The binascii module contains a number of methods to convert between binary and various ASCII-encoded binary representations.

For example you may use binascii.hexlify to obtain an hexadecimal representation of the binary string "LOL", and turn it into an integer through the int built-in function:

>>> binascii.hexlify(b"LOL")
b'4c4f4c'
>>> int(binascii.hexlify(b"LOL"), 16)
5001036

Since you need to apply this to unicode strings, you'll need first to encode them as binary strings. You can use the method str.encode for this purpose:

>>> int(binascii.hexlify("fiŝaĵo".encode("utf-8")), 16)
7379646744164087151

That's it.

For the vice versa, you will need to reverse each step. Firstly turn the integer in a hexadecimal representation as binary string (you can go with format(int, "x") and then encode it), turn the hex in ascii with binascii.unhexlify and finally decode as utf-8:

>>> binascii.unhexlify(format(7379646744164087151, "x").encode("utf-8")).decode("utf-8")
'fiŝaĵo'

This was a step-by-step explanation, if you really will be using this facilities it would be a good idea to arrange them in form of functions.

Solution 4

Building on the solution given by Martijn Pieters, you can make your string a huge number, what Python 3 can deal very well, since it's int type is arbitrarily large (that is not "how computers works", see my commentary of your question).

Given the list of character numerical codes:

>>> a = [ord(c) for c in 'Hello World!']
>>> print(a)
[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]

And knowing, from Wikipedia's page on Unicode that the greatest unicode character number is 10FFFF (in hexadecimal), you can do:

def numfy(s):
    number = 0
    for e in [ord(c) for c in s]:
        number = (number * 0x110000) + e
    return number

def denumfy(number):
    l = []
    while(number != 0):
        l.append(chr(number % 0x110000))
        number = number // 0x110000
    return ''.join(reversed(l))

Thus:

>>> a = numfy("Hello, World, عالَم, ދުނިޔެ, जगत, 世界")
>>> a
31611336900126021[...]08666956
>>> denumfy(a)
'Hello, World, عالَم, ދުނިޔެ, जगत, 世界'

Where this 0x110000 (from 10FFFF + 1) is the number of different foreseen Unicode characters (1114112, in decimal). If you are sure you are only using English alphabet, you can use here 128, and if you are using some Latin language with accents, it is safe to use 256. Either way your number will be much smaller, but it will be unable to represent every Unicode character.

Share:
28,434
Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    As we all know, a computer works with numbers. I'm typing this text right now, the server makes a number out of it and when you want to read it, you'll get text from the server.

    How can I do this on my own?

    I want to encrypt something with my own algorithm and my algorithm works fine with integers, but now I want to encrypt a String and I don't know how to convert a Unicode string to an integer number and vice versa.

    I'm using Python 3. Is there anybody who knows an elegant solution for my problem?