How can I use regular expression for unicode string in python?

11,174

Solution 1

I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.

>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"

Make your string unicode by placing a u before the quotation marks

>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)

Set the flag to unicode, so that it will match unicode strings as well (see docs).

(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:

pattern = re.compile(ur'>([а-яА-Я\s]+)<')

In that case, you don't have to set a flag anymore, since you're not using a special sequence.)

>>> match = pattern.findall(string)
>>> for i in match:
...     print i
... 
Я люблю мороженое

Solution 2

According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:

# -*- coding: utf-8 -*-

Furthermore, try adding 'ur' before the string so that it's raw and Unicode:

state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)

I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

Share:
11,174
Mahdi
Author by

Mahdi

Updated on June 16, 2022

Comments

  • Mahdi
    Mahdi almost 2 years

    Hi I wanna use regular expression for unicode utf-8 in following string:

    </td><td>عـــــــــــادي</td><td> 40.00</td>
    

    I want to pick "عـــــــــــادي" out, how Can I do this?

    My code for this is :

    state = re.findall(r'td>...</td',s)
    

    Thanks

  • Jong Bor Lee
    Jong Bor Lee about 12 years
    I suggest changing that line to state = re.search(ur'td>([^<]+)</td',s) and then getting the desired output by calling state.group(1).