Get a list of all the encodings Python can encode to
Solution 1
Unfortunately encodings.aliases.aliases.keys()
is NOT an appropriate answer.
aliases
(as one would/should expect) contains several cases where different keys are mapped to the same value e.g. 1252
and windows_1252
are both mapped to cp1252
. You could save time if instead of aliases.keys()
you use set(aliases.values())
.
BUT THERE'S A WORSE PROBLEM: aliases
doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).
>>> from encodings.aliases import aliases
>>> def find(q):
... return [(k,v) for k, v in aliases.items() if q in k or q in v]
...
>>> find('1252') # multiple aliases
[('1252', 'cp1252'), ('windows_1252', 'cp1252')]
>>> find('856') # no codepage 856 in aliases
[]
>>> find('koi8') # no koi8_u in aliases
[('cskoi8r', 'koi8_r')]
>>> 'x'.decode('cp856') # but cp856 is a valid codec
u'x'
>>> 'x'.decode('koi8_u') # but koi8_u is a valid codec
u'x'
>>>
It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g. zlib
, quopri
, and base64
.
Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.
For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?
What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].
Solution 2
Other answers here seem to indicate that constructing this list programmatically is difficult and fraught with traps. However, doing so is probably unnecessary since the documentation contains a complete list of the standard encodings Python supports, and has done since Python 2.3.
You can find these lists (for each stable version of the language so far released) at:
- https://docs.python.org/2.3/lib/node130.html
- https://docs.python.org/2.4/lib/standard-encodings.html
- https://docs.python.org/2.5/lib/standard-encodings.html
- https://docs.python.org/2.6/library/codecs.html#standard-encodings
- https://docs.python.org/2.7/library/codecs.html#standard-encodings
- https://docs.python.org/3.0/library/codecs.html#standard-encodings
- https://docs.python.org/3.1/library/codecs.html#standard-encodings
- https://docs.python.org/3.2/library/codecs.html#standard-encodings
- https://docs.python.org/3.3/library/codecs.html#standard-encodings
- https://docs.python.org/3.4/library/codecs.html#standard-encodings
- https://docs.python.org/3.5/library/codecs.html#standard-encodings
- https://docs.python.org/3.6/library/codecs.html#standard-encodings
- https://docs.python.org/3.7/library/codecs.html#standard-encodings
Below are the lists for each documented version of Python. Note that if you want backwards-compatibility rather than just supporting a particular version of Python, you can just copy the list from the latest Python version and check whether each encoding exists in the Python running your program before trying to use it.
Python 2.3 (59 encodings)
['ascii',
'cp037',
'cp424',
'cp437',
'cp500',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp869',
'cp874',
'cp875',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8']
Python 2.4 (85 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8']
Python 2.5 (86 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 2.6 (90 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 2.7 (93 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp720',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp858',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.0 (89 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.1 (90 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.2 (92 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp720',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp858',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.3 (93 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp424',
'cp437',
'cp500',
'cp720',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp858',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp65001',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.4 (96 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp273',
'cp424',
'cp437',
'cp500',
'cp720',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp858',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1125',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp65001',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_u',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.5 (98 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp273',
'cp424',
'cp437',
'cp500',
'cp720',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp858',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1125',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp65001',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_t',
'koi8_u',
'kz1048',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.6 (98 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp273',
'cp424',
'cp437',
'cp500',
'cp720',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp858',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1125',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp65001',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_t',
'koi8_u',
'kz1048',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
Python 3.7 (98 encodings)
['ascii',
'big5',
'big5hkscs',
'cp037',
'cp273',
'cp424',
'cp437',
'cp500',
'cp720',
'cp737',
'cp775',
'cp850',
'cp852',
'cp855',
'cp856',
'cp857',
'cp858',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp874',
'cp875',
'cp932',
'cp949',
'cp950',
'cp1006',
'cp1026',
'cp1125',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp65001',
'euc_jp',
'euc_jis_2004',
'euc_jisx0213',
'euc_kr',
'gb2312',
'gbk',
'gb18030',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'latin_1',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'iso8859_10',
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'johab',
'koi8_r',
'koi8_t',
'koi8_u',
'kz1048',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'ptcp154',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_7',
'utf_8',
'utf_8_sig']
In case they're relevant to anyone's use case, note that the docs also list some Python-specific encodings, many of which seem to be primarily for use by Python's internals or are otherwise weird in some way, like the 'undefined'
encoding which always throws an exception if you try to use it. You probably want to ignore these completely if, like the question-asker here, you're trying to figure out what encoding was used for some text you've come across in the real world. As of Python 3.7, the list is as follows:
["idna",
"mbcs",
"oem",
"palmos",
"punycode",
"raw_unicode_escape",
"rot_13",
"undefined",
"unicode_escape",
"unicode_internal",
"base64_codec",
"bz2_codec",
"hex_codec",
"quopri_codec",
"uu_codec",
"zlib_codec"]
Some older Python versions had a string_escape
special encoding that I've not included in the above list because it's been removed from the language.
Finally, in case you'd like to update my tables above for a newer version of Python, here's the (crude, not very robust) script I used to generate them:
import requests
import lxml.html
import pprint
for version, url in [
('2.3', 'https://docs.python.org/2.3/lib/node130.html'),
('2.4', 'https://docs.python.org/2.4/lib/standard-encodings.html'),
('2.5', 'https://docs.python.org/2.5/lib/standard-encodings.html'),
('2.6', 'https://docs.python.org/2.6/library/codecs.html#standard-encodings'),
('2.7', 'https://docs.python.org/2.7/library/codecs.html#standard-encodings'),
('3.0', 'https://docs.python.org/3.0/library/codecs.html#standard-encodings'),
('3.1', 'https://docs.python.org/3.1/library/codecs.html#standard-encodings'),
('3.2', 'https://docs.python.org/3.2/library/codecs.html#standard-encodings'),
('3.3', 'https://docs.python.org/3.3/library/codecs.html#standard-encodings'),
('3.4', 'https://docs.python.org/3.4/library/codecs.html#standard-encodings'),
('3.5', 'https://docs.python.org/3.5/library/codecs.html#standard-encodings'),
('3.6', 'https://docs.python.org/3.6/library/codecs.html#standard-encodings'),
('3.7', 'https://docs.python.org/3.7/library/codecs.html#standard-encodings'),
]:
html = requests.get(url).text
doc = lxml.html.fromstring(html)
standard_encodings_table = doc.xpath(
'//table[preceding::h2[.//text()[contains(., "Standard Encodings")]]][//th/text()="Codec"]'
)[0]
codecs = standard_encodings_table.xpath('.//td[1]/text()')
print("## Python %s (%i encodings)" % (version, len(codecs)))
print('<pre><code>' + pprint.pformat(codecs) + '</code></pre>')
Solution 3
Maybe you should try using the Universal Encoding Detector (chardet) library instead of implementing it yourself.
>>> import chardet
>>> s = '\xe2\x98\x83' # ☃
>>> chardet.detect(s)
{'confidence': 0.505, 'encoding': 'utf-8'}
Solution 4
You could use a technique to list all modules in the encodings
package.
import pkgutil
import encodings
false_positives = set(["aliases"])
found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found
Solution 5
I doubt there is such method/functionality in codecs module, but if you see encoding/__init__.py
, search function searches thru encodings modules folder, so you may do the same e.g.
>>> os.listdir(os.path.dirname(encodings.__file__))
['cp500.pyc', 'utf_16_le.py', 'gb18030.py', 'mbcs.pyc', 'undefined.pyc', 'idna.pyc', 'punycode.pyc', 'cp850.py', 'big5hkscs.pyc', 'mac_arabic.py', '__init__.pyc', 'string_escape.py', 'hz.py', 'cp037.py', 'cp737.py', 'iso8859_5.pyc', 'iso8859_13.pyc', 'cp861.pyc', 'cp862.py', 'iso8859_9.pyc', 'cp949.py', 'base64_codec.pyc', 'koi8_r.py', 'iso8859_2.py', 'ptcp154.pyc', 'uu_codec.pyc', 'mac_croatian.pyc', 'charmap.pyc', 'iso8859_15.pyc', 'euc_jp.py', 'cp1250.py', 'iso8859_10.pyc', 'koi8_r.pyc', 'unicode_escape.pyc', 'cp863.pyc', 'iso8859_4.pyc', 'cp852.py', 'unicode_internal.py', 'big5hkscs.py', 'cp1257.pyc', 'cp1254.py', 'shift_jisx0213.py', 'shift_jis.pyc', 'cp869.pyc', 'hp_roman8.py', 'iso8859_4.py', 'cp775.py', 'cp1251.py', 'mac_cyrillic.pyc', 'mac_greek.pyc', 'mac_roman.pyc', 'iso8859_11.pyc', 'iso8859_6.py', 'utf_8_sig.py', 'iso8859_3.py', 'iso2022_jp_1.py', 'ascii.py', 'cp1026.pyc', 'cp1250.pyc', 'cp950.py', 'raw_unicode_escape.py', 'euc_jis_2004.pyc', 'cp775.pyc', 'euc_kr.py', 'mac
_greek.py', 'big5.pyc', 'shift_jis_2004.pyc', 'gbk.pyc', 'cp1254.pyc', 'cp1255.pyc', 'cp855.pyc', 'string_escape.pyc', 'cp949.pyc', 'cp1258.pyc', 'iso8859_3.pyc', 'mac_iceland.pyc', 'cp1251.pyc', 'cp860.py', 'cp856.py', 'cp874.py', 'iso2022_kr.py', 'cp856.pyc', 'rot_13.py', 'palmos.py', 'iso2022_jp_2.pyc', 'mac_farsi.py', 'koi8_u.pyc', 'cp1256.py', 'iso8859_10.py', 'tis_620.py', 'iso8859_14.pyc', 'cp1253.py', 'cp1258.py', 'cp437.py', 'cp862.pyc', 'mac_turkish.py', 'undefined.py', 'euc_kr.pyc', 'gb18030.pyc', 'aliases.pyc', 'iso8859_9.py', 'uu_codec.py', 'gbk.py', 'quopri_codec.pyc', 'iso8859_7.py', 'mac_iceland.py', 'iso8859_2.pyc', 'euc_jis_2004.py', 'iso2022_jp_3.pyc', 'cp874.pyc', '__init__.py', 'mac_roman.py', 'iso8859_16.py', 'cp866.py', 'unicode_internal.pyc', 'mac_turkish.pyc', 'johab.pyc', 'cp037.pyc', 'punycode.py', 'cp1253.pyc', 'euc_jisx0213.pyc', 'iso2022_jp_2004.pyc', 'iso2022_kr.pyc', 'zlib_codec.pyc', 'cp932.py', 'cp1255.py', 'iso2022_jp_1.pyc', 'cp857.pyc', 'cp424.pyc',
'iso2022_jp_2.py', 'iso2022_jp.pyc', 'mbcs.py', 'utf_8.py', 'palmos.pyc', 'cp1252.pyc', 'aliases.py', 'quopri_codec.py', 'latin_1.pyc', 'iso2022_jp.py', 'zlib_codec.py', 'cp1026.py', 'cp860.pyc', 'cp1252.py', 'hex_codec.pyc', 'iso8859_1.pyc', 'cp850.pyc', 'cp861.py', 'iso8859_15.py', 'cp865.pyc', 'hp_roman8.pyc', 'iso8859_7.pyc', 'mac_latin2.py', 'iso8859_11.py', 'mac_centeuro.pyc', 'iso8859_6.pyc', 'ascii.pyc', 'mac_centeuro.py', 'iso2022_jp_3.py', 'bz2_codec.py', 'mac_arabic.pyc', 'euc_jisx0213.py', 'tis_620.pyc', 'shift_jis_2004.py', 'utf_8.pyc', 'cp855.py', 'mac_romanian.pyc', 'iso8859_8.py', 'cp869.py', 'ptcp154.py', 'utf_16_be.py', 'iso2022_jp_ext.pyc', 'bz2_codec.pyc', 'base64_codec.py', 'latin_1.py', 'charmap.py', 'hz.pyc', 'cp950.pyc', 'cp875.pyc', 'cp1006.pyc', 'utf_16.py', 'shift_jisx0213.pyc', 'cp424.py', 'cp932.pyc', 'iso8859_5.py', 'mac_romanian.py', 'utf_8_sig.pyc', 'iso8859_1.py', 'cp875.py', 'cp437.pyc', 'cp865.py', 'utf_7.py', 'utf_16_be.pyc', 'rot_13.pyc', 'euc_jp.p
yc', 'raw_unicode_escape.pyc', 'iso8859_8.pyc', 'utf_16.pyc', 'iso8859_14.py', 'iso8859_16.pyc', 'cp852.pyc', 'cp737.pyc', 'mac_croatian.py', 'mac_latin2.pyc', 'iso2022_jp_ext.py', 'cp1140.py', 'mac_cyrillic.py', 'cp1257.py', 'cp500.py', 'cp1140.pyc', 'shift_jis.py', 'unicode_escape.py', 'cp864.py', 'cp864.pyc', 'cp857.py', 'hex_codec.py', 'mac_farsi.pyc', 'idna.py', 'johab.py', 'utf_7.pyc', 'cp863.py', 'iso8859_13.py', 'koi8_u.py', 'gb2312.pyc', 'cp1256.pyc', 'cp866.pyc', 'iso2022_jp_2004.py', 'utf_16_le.pyc', 'gb2312.py', 'cp1006.py', 'big5.py']
but as anybody can register a codec, so that won't be exhaustive list.
Comments
-
Amandasaurus almost 2 years
I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over?
The reason I'm trying to do this is because a user has some text that is not encoded correctly. There are funny characters. I know the unicode character that's messing it up. I want to be able to give them an answer like "Your text editor is interpreting that string as X encoding, not Y encoding". I thought I would try to encode that character using one encoding, then decode it again using another encoding, and see if we get the same character sequence.
i.e. something like this:
for encoding1, encoding2 in itertools.permutation(encodinglist(), 2): try: unicode_string = my_unicode_character.encode(encoding1).decode(encoding2) except: pass
-
John Machin over 14 yearsThis doesn't produce a full list -- codecs that don't have aliases aren't mentioned in the aliases map.
-
Amandasaurus over 14 yearsI've expanded the question with details of what I'm trying to achieve.
-
Admin about 11 yearsThis is plain wrong. There is "1251" and "windows_1251", but you list "cp1251". Ahem, it does not work.
-
Mark Amery over 9 years@user649198 I have no idea what you're talking about;
cp1251
exists (windows-1251
is an alias of it) and is supported in Python 2.7 and Python 3. -
Mark Amery over 9 yearsSeems to work, but note that as well as the standard encodings Python supports this also includes silly encodings like
undefined
(always throws an exception if you try to use it) androt_13
. I suggest just using the list of standard encodings from the docs instead. -
wks over 7 yearsI have a reason why to encode the Unicode string as one encoding and decode as another. Some not-so-well-internationalized legacy software write texts in files in one character encoding where another encoding is expected. One example is MP3 files and ID3 tags. Many badly written Chinese MP3 player still encodes metadata in GB18030 (default in Chinese Windows) while labels the tag as LATIN1 or other wrong encoding. Some Python library, e.g. Mutagen, blindly trusts the metadata, and returns wrong str rather than raw bytes. Sometimes the only way to fix the encoding is to try all combinations.
-
wks over 7 yearsWhat's worse, MP3 files from other places of origin may use other different encodings, such as songs in the Japanese language created by Taiwanese singers. If I only know the song is Japanese, I would never imagine the creator used the "big5" encoding (usually for traditional Chinese). That's why I need to try all possibilities. Quodlibet has a "convert encoding" plugin that does exactly this, except its list of encoding is incomplete and sometimes cannot find the actual encoding.
-
Noctis Skytower over 6 yearsI wish this answer was upvoted more. This seems to be the easiest automated way within Python to get a list of codecs usable from within the language. This code allowed me to find out that
latin_1
is a one-for-one translation between ordinals and characters. -
ingyhere over 4 yearsThis was incredibly slow on an 11MB file.
-
dbn over 3 yearschardet has an option to read until it reaches some confidence interval.
-
Victor Schröder over 2 yearsI got 98 encodings and a IBM037 ain't one...
-
Mark Amery over 2 years@VictorSchröder Not so. A quick Google reveals that
cp037
andIBM037
are aliases, andcp037
is in the list. -
Victor Schröder over 2 yearsThanks @Mark Amery, indeed, I found the same info just about a minute after my pretentiously funny comment above... Unfortunately the EBCDIC charset has its own particularities and simply decoding doesn't work as expected to have a valid UTF-8 equivalent. It leaves several non-printable chars behind. Not Python's fault, of course, but it would be nice to have DB dumps like we were already past 1992 sometimes...
-
Shadi over 2 yearsI upvoted this a few months ago and would upvote again now if I could
-
Michael about 2 yearsSomewhat similar to this answer, but certainly distinct.
-
Mark Ransom almost 2 years@NoctisSkytower that's one of the best kept and useful secrets in Python. If you understand the history of Unicode it makes sense though - the first 256 code points in Unicode were defined as the ISO/IEC-8859-1 character set, known by its alternate name Latin-1.