UTF-8 in PHP regular expressions

41,468

Solution 1

Updated answer:
This is now tested and working

$post = '9999, škofja loka';
echo preg_match('/^\\d{4},[\\s\\p{L}]+$/u', $post);

\\w will not work, because it does not contain all unicode letters and contains also [0-9_] additionally to the letters.

Important is also the u modifier to activate the unicode mode.

If there can be letters or whitespace after the comma then you should put those into the same character class, in your regex there are 0 or more whitespace after the comma and then there are only letters.

See http://www.regular-expressions.info/php.html for php regex details

The \\p{L} (Unicode letter) is explained here

Important is also the use of the end of string boundary $ to ensure that really the complete string is verified, otherwise it will match only the first whitespace and ignore the rest for example.

Solution 2

[a-zA-Z] will match only letters in the range of a-z and A-Z. You have non-US-ASCII letters, and therefore your regex won't match, regardless of the /u modifier. You need to use the word character escape sequence (\w).

$post = '9999,škofja loka';
echo preg_match('/^[0-9]{4},[\s]*[\w]+/u', $post);

Solution 3

The problem is your regular expression. You are explicitly saying that you will only accept a b c ... z A B C ... Z. š is not in the a-z set. Remember, š is as different to s as any other character.

So if you really just want a sequence of letters, then you need to test for the unicode properties. e.g.

echo preg_match('/^[0-9]{4},[\s]*\p{L}+', $post);

That shouuld work because \p{L} matches any unicode character which is considered a letter. Not just A through Z.

Share:
41,468
Gasper
Author by

Gasper

Updated on January 03, 2020

Comments

  • Gasper
    Gasper over 4 years

    I need help with regular expressions. My string contains unicode characters and code below doesn't work.

    First four characters must be numbers, then comma and then any alphabetic characters or whitespaces... I already read that if i add /u on end of regular expresion but it didn't work for me...

    My code works with non-unicode characters

    $post = '9999,škofja loka';;
    echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+', $post);
    

    Thanks for your answers!

  • jensgram
    jensgram almost 13 years
    The u modifier alone is not enough, cf. @jmz's answer.
  • Gasper
    Gasper almost 13 years
    This doesn't work right: this should return 0 but it return 1 $post = '9999,ščćžđkofja loka,.(?*'; echo preg_match('/^[0-9]{4},[\s]*\p{L}+/', $post);
  • Gasper
    Gasper almost 13 years
    doesn't work = return 0: $post = '9999,škofja loka'; echo preg_match('/^[0-9]{4},[\s\w]+/u', $post);
  • Gasper
    Gasper almost 13 years
    doesn' work in my case with your code
  • searlea
    searlea almost 13 years
    Note: \w will match numbers too, and \s doesn't need the square brackets. Being concise: /^\d{4},\s*\w+/u
  • stema
    stema almost 13 years
    @gašper, so now I tested it online and it seems that PHP needs to be double escaped preg_match('/^\\d{4},[\\s\\w]+$/u', $post); but it seems that \\w does not include the unicode characters, even with u modifier.
  • Sodved
    Sodved almost 13 years
    One thing - in your test program is the $post program in UTF-8? Sorry I'm not that good at php. But in perl if you just enter the character š you get a string of one byte 9A. In UTF-8 that character needs to be two bytes C5 A1 (which looks like Å¡ in a latin character encoding.
  • stema
    stema almost 13 years
    @gašper, I did some more testing and updated my answer
  • Gasper
    Gasper almost 13 years
    did you test it, still doesn't work
  • Gasper
    Gasper almost 13 years
    @stema, this work completely well, thank you!
  • Gasper
    Gasper almost 13 years
    can i use that regular expression also in js?
  • stema
    stema almost 13 years
    @gašper, I don't think so, http://www.regular-expressions.info/javascript.html is explaining the javascript regex flavour and it says that it does not support unicode (except you give the character explicitly, like ^\d{4},[\sa-zA-Zš]+$)
  • Alan Moore
    Alan Moore almost 13 years
    Even in UTF-8 mode, \w only matches [A-Za-z0-9_]. You have to use Unicode-specific constructs like \p{L} as well as the /u flag.
  • Alan Moore
    Alan Moore almost 13 years
    @jensgram: \w with the u modifier is not enough either; cf. @stema's answer. ;)
  • searlea
    searlea almost 13 years
    @alan Bah... I think I'll skip Monday morning in future...
  • jmz
    jmz almost 13 years
    @Alan: Locale affects what is a letter and what is not. For me, the regex I posted works (fi_FI.UTF-8 locale).
  • llamerr
    llamerr about 12 years
    there is a library for unicode in js and much more xregexp.com