How to detect non-ASCII character in Python?

python string python-2.7 ascii

14,853

Solution 1

# -*- coding: utf-8 -*-

import re

elements = [u'2', u'3', u'13', u'37\u201341', u'43', u'44', u'46']

for e in elements:
    if (re.sub('[ -~]', '', e)) != "":
        #do something here
        print "-"

re.sub('[ -~]', '', e) will strip out any valid ASCII characters in e (Specifically, replace any valid ASCII characters with ""), only non-ASCII characters of e are remained.

Hope this help

Solution 2

You can check the if the character value is between 0 - 127.

for c in someString:
    if 0 <= ord(c) <= 127:
        # this is a ascii character.
    else:
        # this is a non-ascii character. Do something.

Solution 3

Give this a try:

>>> import re
>>> non_decimal = re.compile(r'[^\d.]+')
>>>
>>> string ="[2,3,13,37–41,43,44,46]"
>>> new_str = string.replace("[","")
>>> new_str = new_str.replace("]","")
>>> lst = new_str.split(",")
>>> for element in lst:
    if element.isdigit():
        print element
    else:
        toexpand = non_decimal.sub('f', str(element))
        toexpand = toexpand.split("f")
        for i in range(int(toexpand[0]),int(toexpand[1])+1,1):
            print i


2
3
13
37
38
39
40
41
43
44
46
>>>

Solution 4

This may not answer your whole question. Way too simple and not flexible. I do this whenever I have this error.

I usually open up an interactive python shell and I type in:

print [ln for ln in open("filename.py", "rb").readlines() if "\xe2" in ln]

That gives you lines with \ex2. Then try finding it in your editor.and try removing the character.

View more solutions

14,853

Author by

sheshkovsky

Updated on June 04, 2022

Comments

sheshkovsky almost 2 years
I'm parsing multiple XML files with Python 2.7, there are some strings like: string ="[2,3,13,37–41,43,44,46]". I split them to get a list of all elements, and then I have to detect elements with "–" like "37–41", but it turns out this is not a regular dash, it's a non-ASCII character:
```
elements = [u'2', u'3', u'13', u'37\u201341', u'43', u'44', u'46']
```
So I need something like
```
for e in elements:
  if "–" in e:
      # do something about it
```
If use that non-ASCII char in this if expression, then I get an error: "SyntaxError: Non-ASCII character '\xe2' in file...".

I tried to replace the if expression with this re method:
```
re.search('\xe2', e)
```
but it's not the case again. So I'm looking for a way to either convert that non-ASCII char to a regular ASCII "-" or use the ASCII number directly in the search expression.
sheshkovsky almost 8 years

I want to detect the element containing the non-ascii dash, like 37-41. not ignore it.
sheshkovsky almost 8 years

I still get UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128) . I'm using PyCharm IDE, may that caused the problem?
sheshkovsky almost 8 years

it didn't work for me, I'm using PyCharm IDE and I still get UnicodeDecodeError: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
sheshkovsky almost 8 years

it worked pretty fine. just needed to change last print to to print e and it printed out exactly the element I was looking for. I would appreciated if you can give me a link to the documentation because the syntax is little bit strange.
Frerich Raabe almost 8 years

I think this code is actually fairly obscure - is there really anyone who sees what if (re.sub('[ -~]', '', e)) != "" does right away?
EbraHim almost 8 years

Oops! Updated, but late :)
tripleee over 3 years

This made sense for Python 2, but doesn't directly solve the problem in the question even then. In Python 3, the default encoding is UTF-8 anyway; you only have to use an encoding comment if you want to use something else than UTF-8 (which you really don't want to, unless you know exactly what you are doing, in which case you would probably not be reading this).