How to detect string byte encoding?

94,442

Solution 1

if your files either in cp1252 and utf-8, then there is an easy way.

import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
    for i in codecs:
        try:
            return string.decode(i)
        except UnicodeDecodeError:
            pass

    logging.warn("cannot decode url %s" % ([string]))

for item in os.listdir(rootPath):
    #Convert to Unicode
    if isinstance(item, str):
        item = force_decode(item)
    print item

otherwise, there is a charset detect lib.

Python - detect charset and convert to utf-8

https://pypi.python.org/pypi/chardet

Solution 2

Use chardet library. It is super easy

import chardet

the_encoding = chardet.detect('your string')['encoding']

and that's it!

in python3 you need to provide type bytes or bytearray so:

import chardet
the_encoding = chardet.detect(b'your string')['encoding']

Solution 3

You also can use json package to detect encoding.

import json

json.detect_encoding(b"Hello")
Share:
94,442

Related videos on Youtube

Philipp
Author by

Philipp

Updated on February 24, 2022

Comments

  • Philipp
    Philipp about 2 years

    I've got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.

    I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

    Example:

    for item in os.listdir(rootPath):
    
        #Convert to Unicode
        if isinstance(item, str):
            item = item.decode('cp1252')  # or item = item.decode('utf-8')
        print item
    
  • Taras Vaskiv
    Taras Vaskiv almost 6 years
    Seems to me it doesnt work. I have created string variable and encoded it utf-8. chardet returned TIS-620 encoding.
  • Martin Haeberli
    Martin Haeberli about 5 years
    I found that cchardet appears to be the current name for this or a similar library...; chardet was not findable.
  • Yoav Vollansky
    Yoav Vollansky almost 5 years
    A bit confused here. It seems like it isn't possible to provide an str class as an argument. Only b'your string' works for me, or directly providing a byte variable.
  • artfulrobot
    artfulrobot over 4 years
    The problem with this answer for me is that some cp1252/latin1 characters can be interpreted as technically valid utf8 - which leads to ê type characters where it should have been ê. chardet seems to try utf8 first, which results in this. There may be a way to tell it which order to use, but lucemia's answer worked better for me.
  • artfulrobot
    artfulrobot over 4 years
    ↑ sorry, I think I got utf8 and cp1252 the wrong way round in my description in last comment!
  • HelloGoodbye
    HelloGoodbye over 3 years
    In Python 3: TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
  • Frederick Reynolds
    Frederick Reynolds about 3 years
    @HelloGoodbye You need to provide a byte string or bytearray, not a string to decode.
  • kontur
    kontur about 2 years
    >>> chardet.detect("ö".encode()) and {'encoding': 'TIS-620', 'confidence': 0.99, 'language': 'Thai'} — I'd say that doesn't work.