Python: Convert utf-8 string to byte string

10,951

If utf-8 => bytestring conversion is what do you want then you may use str.encode, but first you need to properly mark the type of source string in your example - prefix with u for unicode:

# coding: utf-8
import random

    def random_utf8_seq(length):
        # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
        test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"

        utf8_seq = u''

        for i in range(length):
            utf8_seq += random.choice(test_charset)

        print utf8_seq.encode('utf-8')
        return utf8_seq.encode('utf-8')

    print( type(random_utf8_seq(200)) )

-- output -- ­

õ3×sÔP{Ć.s(Ë°˙ě÷xÓ@bűV—û´ő¢uZÓČn˜0|_"Ðyø`êš·ÏÝhunÍÅ=ä?
óP{tlÇűpb¸7s´ňƒG—čøň\zčłŢXÂYqLĆúěă(ÿî ¥PyÐÔŇnל¦Ì˝+•ì›
ŻÛ°Ñ^ÝC÷ŢŐIñJĹţÒył­"MťÆ‹ČČ4þ!»šåŮ@Öhň-
ÈLGĄ¢ß˛Đ¯.ªÆź˘Ř^ĽÛŹËaĂŕ¹#¢éüÜńlÊqš=VřU…‚–MŽÎÉèoÙŹŠ¨Ð
<type 'str'>
Share:
10,951
mythander889
Author by

mythander889

Updated on June 27, 2022

Comments

  • mythander889
    mythander889 almost 2 years

    I have the following function to parse a utf-8 string from a sequence of bytes

    Note -- 'length_size' is the number of bytes it take to represent the length of the utf-8 string

    def parse_utf8(self, bytes, length_size):
    
        length = bytes2int(bytes[0:length_size])
        value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]])
        return value
    
    
    def bytes2int(raw_bytes, signed=False):
        """
        Convert a string of bytes to an integer (assumes little-endian byte order)
        """
        if len(raw_bytes) == 0:
            return None
        fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)]
        if signed:
            fmt = fmt.lower()
        return struct.unpack('<'+fmt, raw_bytes)[0]
    

    I'd like to write the function in reverse -- i.e. a function that will take a utf-8 encoded string and return it's representation as a byte string.

    So far, I have the following:

    def create_utf8(self, utf8_string):
        return utf8_string.encode('utf-8')
    

    I run into the following error when attempting to test it:

      File "writer.py", line 229, in create_utf8
    return utf8_string.encode('utf-8')
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)
    

    If possible, I'd like to adopt a structure for the code similar to the parse_utf8 example. What am I doing wrong?

    Thank you for your help!

    UPDATE: test driver, now correct

    def random_utf8_seq(self, length):
        # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
        test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"
    
        utf8_seq = u""
    
        for i in range(length):
            utf8_seq += random.choice(test_charset)
    
        return utf8_seq
    

    I get the following error:

    input_str = self.random_utf8_seq(200)
      File "writer.py", line 226, in random_utf8_seq
    print unicode(utf8_seq, "utf-8")
      UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte
    
  • mythander889
    mythander889 about 10 years
    That did it, thank you! In the end, my string generation driver was incorrect, as you suggested