Detect charset and convert to utf-8 in Python?

75,477

Solution 1

You want to use chardet, an encoding detector

Solution 2

It's a bit late, but there is also another solution: try to use pyicu.

An example:

import icu
def convert_encoding(data, new_coding='UTF-8'):
    coding = icu.CharsetDetector(data).detect().getName()
    if new_coding.upper() != coding.upper():
        data = unicode(data, coding).encode(new_coding)
    return data

Solution 3

If you want to do it with cchardet, you can use this function.

import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
  encoding = cchardet.detect(data)['encoding']

  if new_coding.upper() != encoding.upper():
    data = data.decode(encoding, data).encode(new_coding)

  return data

Solution 4

There is another module called cchardet

It is said to be faster than chardet.

Note that it requires Cython

Share:
75,477
robos85
Author by

robos85

Updated on July 08, 2020

Comments

  • robos85
    robos85 almost 4 years

    Is there any universal method to detect string charset? I user IPTC tags and have no known encoding. I need to detect it and then change them to utf-8.

    Anybody can help?

  • robos85
    robos85 almost 13 years
    It doesn't work, I trierd it before asking here. Some strings get None encoding but it's not true. Tags are encoded somehow because on 1 web portal they are recognized.
  • Bruno Feroleto
    Bruno Feroleto almost 13 years
    +1: chardet seems to be one of the best current ways of doing encoding detection. @robos85: It is not possible to do a perfect encoding detection: stackoverflow.com/questions/436220/….
  • tchrist
    tchrist about 12 years
    I have developed a much more successful way of detecting the encoding, based on knowing the language. It gets the 8-bit encodings right. Finally.
  • MestreLion
    MestreLion over 10 years
    @tchrist: care to share?
  • coanor
    coanor over 9 years
    pyicu based on icu, and sometimes will miss-detect some encoding: sourceforge.net/p/icu/mailman/icu-design/thread/…
  • MestreLion
    MestreLion over 9 years
    @coanor: any encoding detector will fail in some cases, as there is no way to accurately determine the encoding for all tests
  • chourn solidet
    chourn solidet about 3 years
    I tried many encoding format (base64, ... ) . the result is always ascii