Detect charset and convert to utf-8 in Python?
75,477
Solution 1
You want to use chardet
, an encoding detector
Solution 2
It's a bit late, but there is also another solution: try to use pyicu.
An example:
import icu
def convert_encoding(data, new_coding='UTF-8'):
coding = icu.CharsetDetector(data).detect().getName()
if new_coding.upper() != coding.upper():
data = unicode(data, coding).encode(new_coding)
return data
Solution 3
If you want to do it with cchardet, you can use this function.
import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
encoding = cchardet.detect(data)['encoding']
if new_coding.upper() != encoding.upper():
data = data.decode(encoding, data).encode(new_coding)
return data
Solution 4
There is another module called cchardet
It is said to be faster than chardet.
Note that it requires Cython
Author by
robos85
Updated on July 08, 2020Comments
-
robos85 almost 4 years
Is there any universal method to detect string charset? I user IPTC tags and have no known encoding. I need to detect it and then change them to utf-8.
Anybody can help?
-
robos85 almost 13 yearsIt doesn't work, I trierd it before asking here. Some strings get None encoding but it's not true. Tags are encoded somehow because on 1 web portal they are recognized.
-
Bruno Feroleto almost 13 years+1: chardet seems to be one of the best current ways of doing encoding detection. @robos85: It is not possible to do a perfect encoding detection: stackoverflow.com/questions/436220/….
-
tchrist about 12 yearsI have developed a much more successful way of detecting the encoding, based on knowing the language. It gets the 8-bit encodings right. Finally.
-
MestreLion over 10 years@tchrist: care to share?
-
coanor over 9 yearspyicu based on icu, and sometimes will miss-detect some encoding: sourceforge.net/p/icu/mailman/icu-design/thread/…
-
MestreLion over 9 years@coanor: any encoding detector will fail in some cases, as there is no way to accurately determine the encoding for all tests
-
chourn solidet about 3 yearsI tried many encoding format (base64, ... ) . the result is always ascii