How to convert between bytes and strings in Python 3?

188,818

Solution 1

The 'mangler' in the above code sample was doing the equivalent of this:

bytesThing = stringThing.encode(encoding='UTF-8')

There are other ways to write this (notably using bytes(stringThing, encoding='UTF-8'), but the above syntax makes it obvious what is going on, and also what to do to recover the string:

newStringThing = bytesThing.decode(encoding='UTF-8')

When we do this, the original string is recovered.

Note, using str(bytesThing) just transcribes all the gobbledegook without converting it back into Unicode, unless you specifically request UTF-8, viz., str(bytesThing, encoding='UTF-8'). No error is reported if the encoding is not specified.

Solution 2

In python3, there is a bytes() method that is in the same format as encode().

str1 = b'hello world'
str2 = bytes("hello world", encoding="UTF-8")
print(str1 == str2) # Returns True

I didn't read anything about this in the docs, but perhaps I wasn't looking in the right place. This way you can explicitly turn strings into byte streams and have it more readable than using encode and decode, and without having to prefex b in front of quotes.

Solution 3

This is a Python 101 type question,

It's a simple question but one where the answer is not so simple.


In python3, a "bytes" object represents a sequence of bytes, a "string" object represents a sequence of unicode code points.

To convert between from "bytes" to "string" and from "string" back to "bytes" you use the bytes.decode and string.encode functions. These functions take two parameters, an encoding and an error handling policy.

Sadly there are an awful lot of cases where sequences of bytes are used to represent text, but it is not necessarily well-defined what encoding is being used. Take for example filenames on unix-like systems, as far as the kernel is concerned they are a sequence of bytes with a handful of special values, on most modern distros most filenames will be UTF-8 but there is no gaurantee that all filenames will be.

If you want to write robust software then you need to think carefully about those parameters. You need to think carefully about what encoding the bytes are supposed to be in and how you will handle the case where they turn out not to be a valid sequence of bytes for the encoding you thought they should be in. Python defaults to UTF-8 and erroring out on any byte sequence that is not valid UTF-8.

print(bytesThing)

Python uses "repr" as a fallback conversion to string. repr attempts to produce python code that will recreate the object. In the case of a bytes object this means among other things escaping bytes outside the printable ascii range.

Solution 4

TRY THIS:

StringVariable=ByteVariable.decode('UTF-8','ignore')

TO TEST TYPE:

print(type(StringVariable))

Here 'StringVariable' represented as a string. 'ByteVariable' represent as Byte. Its not relevent to question Variables..

Share:
188,818

Related videos on Youtube

Bobble
Author by

Bobble

Grey-haired geek, perennial polymath. I used to develop in Java but I got tired of all the brackets and static typing, so I now play with Python. In case people wonder about why I am also registered for the English and Chinese Stack Exchange forums, I recently spent a couple of years in China teaching English.

Updated on July 08, 2022

Comments

  • Bobble
    Bobble almost 2 years

    This is a Python 101 type question, but it had me baffled for a while when I tried to use a package that seemed to convert my string input into bytes.

    As you will see below I found the answer for myself, but I felt it was worth recording here because of the time it took me to unearth what was going on. It seems to be generic to Python 3, so I have not referred to the original package I was playing with; it does not seem to be an error (just that the particular package had a .tostring() method that was clearly not producing what I understood as a string...)

    My test program goes like this:

    import mangler                                 # spoof package
    
    stringThing = """
    <Doc>
        <Greeting>Hello World</Greeting>
        <Greeting>你好</Greeting>
    </Doc>
    """
    
    # print out the input
    print('This is the string input:')
    print(stringThing)
    
    # now make the string into bytes
    bytesThing = mangler.tostring(stringThing)    # pseudo-code again
    
    # now print it out
    print('\nThis is the bytes output:')
    print(bytesThing)
    

    The output from this code gives this:

    This is the string input:
    
    <Doc>
        <Greeting>Hello World</Greeting>
        <Greeting>你好</Greeting>
    </Doc>
    
    
    This is the bytes output:
    b'\n<Doc>\n    <Greeting>Hello World</Greeting>\n    <Greeting>\xe4\xbd\xa0\xe5\xa5\xbd</Greeting>\n</Doc>\n'
    

    So, there is a need to be able to convert between bytes and strings, to avoid ending up with non-ascii characters being turned into gobbledegook.

    • Bobble
      Bobble over 11 years
      This question gives more detail in the answer, but I think the briefer answer below is more clear.
  • ccpizza
    ccpizza almost 8 years
    If you look at the actual method implementations you'll see that utf-8 is the default encoding, therefore you can omit it given that you know that the encoding is indeed utf-8, i.e. stringThing.encode() and bytesThing.decode() will do just fine.
  • Bobble
    Bobble almost 8 years
    @ccpizza Making the encoding explicit in the above examples makes it much clearer what is going on, and IMHO is good practice. Not all unicode is UTF-8. It also avoids the silent failure referred to in the last paragraph.
  • ccpizza
    ccpizza almost 8 years
    totally agree; explicit is better than implicit, but imo it is good to know what is the implicit. Whether to use it or not is another question. Just because you can doesn't mean you should :)
  • Nagev
    Nagev over 6 years
    In Python 3 it's safer to use decode('utf-8', 'backslashreplace') to avoid an exception if the encoding is unknown. One shouldn't always assume UTF-8!