Russian symbols in re (Python)
Solution 1
To use \w+
to match alphanumeric unicode characters you should pass both a unicode
pattern and unicode
text to re.findall
.
-
In Python2:
Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a
unicode
:uni = 'Привет, как дела?'.decode('utf-8')
ur'(?u)\w+'
is a raw unicode literal. Even though it is not necessary here, using raw unicode/string literals for regex patterns is generally a good practice -- it allows you to avoid the need for double backslashes before certain characters such as\s
.The regex pattern
ur'(?u)\w+'
bakes-in the Unicode flag which tellsre.findall
to make\w
dependent on the Unicode character properties database.import re uni = 'Привет, как дела?'.decode('utf-8') print(re.findall(ur'(?u)\w+', uni))
yields a list containing the 3 unicode "words":
[u'\u041f\u0440\u0438\u0432\u0435\u0442', u'\u043a\u0430\u043a', u'\u0434\u0435\u043b\u0430']
-
In Python3:
The general principle is the same, except that what were
unicode
s in Python2 are nowstr
s in Python3, and there is no longer any attempt at automatic conversion between the two. So, again assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain astr
, and use astr
regex pattern:import re uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf') print(re.findall(r'(?u)\w+', uni))
yields
['Привет', 'как', 'дела']
Solution 2
My solution:
txt = re.findall(r'[А-я]+', data)
А-я - Russian alphabet letters
Solution 3
you are taking a string that is already unicode and encoding it as unicode if you omit the encoding part you get:
line = u"Привет, как дела?"
words = re.findall(r'[\w]+',line ,re.U)
# words = [u'\u041f\u0440\u0438\u0432\u0435\u0442', u'\u043a\u0430\u043a', u'\u0434\u0435\u043b\u0430']
print words[0]
# prints Привет
Solution 4
Consult UTF Cyrillic block to define regex precisely:
- https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)
- https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
Most codepoints are in a range, but some are not:
re.compile('[А-Яа-яЁё]+')
re.fullmatch("[А-Яа-яЁё ]+", "Ёжик в тумане")
Also you might want to include Ѣ ѣ
(Ять) or other old symbols depending on your needs.
Related videos on Youtube
Comments
-
Queen johniek over 1 year
I get a data from a file:
words = re.findall(r'[\w]+',self._from.encode('utf8'),re.U)
If the file contains:
Hi, how are you?
Then result will be:
['Hi', 'how', 'are', 'you']
But if the file contains russian language (i.e. cyrillic symbols), then:
Привет, как дела?
In this case the result is:
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xba\xd0', '\xd0\xba', '\xd0', '\xd0\xb5\xd0', '\xd0']
why? wtf? I've already added:
sys.setdefaultencoding('utf-8')
I'm using python2.7 and linux ubuntu.
Answer:
words = re.findall(r'[\w]+',self._from.decode('utf8'),re.U) print u" ".join(words)