Detect charset and convert to utf-8 in Python?

python utf-8 python-2.x

75,477

Solution 1

You want to use chardet, an encoding detector

Solution 2

It's a bit late, but there is also another solution: try to use pyicu.

An example:

import icu
def convert_encoding(data, new_coding='UTF-8'):
    coding = icu.CharsetDetector(data).detect().getName()
    if new_coding.upper() != coding.upper():
        data = unicode(data, coding).encode(new_coding)
    return data

Solution 3

If you want to do it with cchardet, you can use this function.

import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
  encoding = cchardet.detect(data)['encoding']

  if new_coding.upper() != encoding.upper():
    data = data.decode(encoding, data).encode(new_coding)

  return data

Solution 4

There is another module called cchardet

It is said to be faster than chardet.

Note that it requires Cython

View more solutions

75,477

Author by

robos85

Updated on July 08, 2020

Comments

robos85 almost 4 years

Is there any universal method to detect string charset? I user IPTC tags and have no known encoding. I need to detect it and then change them to utf-8.

Anybody can help?
robos85 almost 13 years

It doesn't work, I trierd it before asking here. Some strings get None encoding but it's not true. Tags are encoded somehow because on 1 web portal they are recognized.
Bruno Feroleto almost 13 years

+1: chardet seems to be one of the best current ways of doing encoding detection. @robos85: It is not possible to do a perfect encoding detection: stackoverflow.com/questions/436220/….
tchrist about 12 years

I have developed a much more successful way of detecting the encoding, based on knowing the language. It gets the 8-bit encodings right. Finally.
MestreLion over 10 years

@tchrist: care to share?
coanor over 9 years

pyicu based on icu, and sometimes will miss-detect some encoding: sourceforge.net/p/icu/mailman/icu-design/thread/…
MestreLion over 9 years

@coanor: any encoding detector will fail in some cases, as there is no way to accurately determine the encoding for all tests
chourn solidet about 3 years

I tried many encoding format (base64, ... ) . the result is always ascii