How to detect string byte encoding?

python string unicode encoding byte

94,442

Solution 1

if your files either in cp1252 and utf-8, then there is an easy way.

import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
    for i in codecs:
        try:
            return string.decode(i)
        except UnicodeDecodeError:
            pass

    logging.warn("cannot decode url %s" % ([string]))

for item in os.listdir(rootPath):
    #Convert to Unicode
    if isinstance(item, str):
        item = force_decode(item)
    print item

otherwise, there is a charset detect lib.

Python - detect charset and convert to utf-8

https://pypi.python.org/pypi/chardet

Solution 2

Use chardet library. It is super easy

import chardet

the_encoding = chardet.detect('your string')['encoding']

and that's it!

in python3 you need to provide type bytes or bytearray so:

import chardet
the_encoding = chardet.detect(b'your string')['encoding']

Solution 3

You also can use json package to detect encoding.

import json

json.detect_encoding(b"Hello")

94,442

Philipp

Updated on February 24, 2022

Comments

Philipp about 2 years
I've got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.

I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

Example:
```
for item in os.listdir(rootPath):

    #Convert to Unicode
    if isinstance(item, str):
        item = item.decode('cp1252')  # or item = item.decode('utf-8')
    print item
```
Taras Vaskiv almost 6 years

Seems to me it doesnt work. I have created string variable and encoded it utf-8. chardet returned TIS-620 encoding.
Martin Haeberli about 5 years

I found that cchardet appears to be the current name for this or a similar library...; chardet was not findable.
Yoav Vollansky almost 5 years

A bit confused here. It seems like it isn't possible to provide an str class as an argument. Only b'your string' works for me, or directly providing a byte variable.
artfulrobot over 4 years

The problem with this answer for me is that some cp1252/latin1 characters can be interpreted as technically valid utf8 - which leads to Ãª type characters where it should have been ê. chardet seems to try utf8 first, which results in this. There may be a way to tell it which order to use, but lucemia's answer worked better for me.
artfulrobot over 4 years

↑ sorry, I think I got utf8 and cp1252 the wrong way round in my description in last comment!
HelloGoodbye over 3 years

In Python 3: TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
Frederick Reynolds about 3 years

@HelloGoodbye You need to provide a byte string or bytearray, not a string to decode.
kontur about 2 years

>>> chardet.detect("ö".encode()) and {'encoding': 'TIS-620', 'confidence': 0.99, 'language': 'Thai'} — I'd say that doesn't work.