How to detect non-ASCII character in Python?

14,853

Solution 1

# -*- coding: utf-8 -*-

import re

elements = [u'2', u'3', u'13', u'37\u201341', u'43', u'44', u'46']

for e in elements:
    if (re.sub('[ -~]', '', e)) != "":
        #do something here
        print "-"

re.sub('[ -~]', '', e) will strip out any valid ASCII characters in e (Specifically, replace any valid ASCII characters with ""), only non-ASCII characters of e are remained.

Hope this help

Solution 2

You can check the if the character value is between 0 - 127.

for c in someString:
    if 0 <= ord(c) <= 127:
        # this is a ascii character.
    else:
        # this is a non-ascii character. Do something.

Solution 3

Give this a try:

>>> import re
>>> non_decimal = re.compile(r'[^\d.]+')
>>>
>>> string ="[2,3,13,37–41,43,44,46]"
>>> new_str = string.replace("[","")
>>> new_str = new_str.replace("]","")
>>> lst = new_str.split(",")
>>> for element in lst:
    if element.isdigit():
        print element
    else:
        toexpand = non_decimal.sub('f', str(element))
        toexpand = toexpand.split("f")
        for i in range(int(toexpand[0]),int(toexpand[1])+1,1):
            print i


2
3
13
37
38
39
40
41
43
44
46
>>> 

Solution 4

This may not answer your whole question. Way too simple and not flexible. I do this whenever I have this error.

I usually open up an interactive python shell and I type in:

print [ln for ln in open("filename.py", "rb").readlines() if "\xe2" in ln]

That gives you lines with \ex2. Then try finding it in your editor.and try removing the character.

Share:
14,853
sheshkovsky
Author by

sheshkovsky

Updated on June 04, 2022

Comments

  • sheshkovsky
    sheshkovsky almost 2 years

    I'm parsing multiple XML files with Python 2.7, there are some strings like: string ="[2,3,13,37–41,43,44,46]". I split them to get a list of all elements, and then I have to detect elements with "–" like "37–41", but it turns out this is not a regular dash, it's a non-ASCII character:

    elements = [u'2', u'3', u'13', u'37\u201341', u'43', u'44', u'46']
    

    So I need something like

    for e in elements:
      if "–" in e:
          # do something about it
    

    If use that non-ASCII char in this if expression, then I get an error: "SyntaxError: Non-ASCII character '\xe2' in file...".

    I tried to replace the if expression with this re method:

    re.search('\xe2', e)
    

    but it's not the case again. So I'm looking for a way to either convert that non-ASCII char to a regular ASCII "-" or use the ASCII number directly in the search expression.

  • sheshkovsky
    sheshkovsky almost 8 years
    I want to detect the element containing the non-ascii dash, like 37-41. not ignore it.
  • sheshkovsky
    sheshkovsky almost 8 years
    I still get UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128) . I'm using PyCharm IDE, may that caused the problem?
  • sheshkovsky
    sheshkovsky almost 8 years
    it didn't work for me, I'm using PyCharm IDE and I still get UnicodeDecodeError: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
  • sheshkovsky
    sheshkovsky almost 8 years
    it worked pretty fine. just needed to change last print to to print e and it printed out exactly the element I was looking for. I would appreciated if you can give me a link to the documentation because the syntax is little bit strange.
  • Frerich Raabe
    Frerich Raabe almost 8 years
    I think this code is actually fairly obscure - is there really anyone who sees what if (re.sub('[ -~]', '', e)) != "" does right away?
  • EbraHim
    EbraHim almost 8 years
    Oops! Updated, but late :)
  • tripleee
    tripleee over 3 years
    This made sense for Python 2, but doesn't directly solve the problem in the question even then. In Python 3, the default encoding is UTF-8 anyway; you only have to use an encoding comment if you want to use something else than UTF-8 (which you really don't want to, unless you know exactly what you are doing, in which case you would probably not be reading this).