UTF-8 in PHP regular expressions
Solution 1
Updated answer:
This is now tested and working
$post = '9999, škofja loka';
echo preg_match('/^\\d{4},[\\s\\p{L}]+$/u', $post);
\\w
will not work, because it does not contain all unicode letters and contains also [0-9_]
additionally to the letters.
Important is also the u
modifier to activate the unicode mode.
If there can be letters or whitespace after the comma then you should put those into the same character class, in your regex there are 0 or more whitespace after the comma and then there are only letters.
See http://www.regular-expressions.info/php.html for php regex details
The \\p{L}
(Unicode letter) is explained here
Important is also the use of the end of string boundary $
to ensure that really the complete string is verified, otherwise it will match only the first whitespace and ignore the rest for example.
Solution 2
[a-zA-Z]
will match only letters in the range of a-z and A-Z. You have non-US-ASCII letters, and therefore your regex won't match, regardless of the /u
modifier. You need to use the word character escape sequence (\w
).
$post = '9999,škofja loka';
echo preg_match('/^[0-9]{4},[\s]*[\w]+/u', $post);
Solution 3
The problem is your regular expression. You are explicitly saying that you will only accept a b c ... z A B C ... Z
. š
is not in the a-z set. Remember, š
is as different to s
as any other character.
So if you really just want a sequence of letters, then you need to test for the unicode properties. e.g.
echo preg_match('/^[0-9]{4},[\s]*\p{L}+', $post);
That shouuld work because \p{L}
matches any unicode character which is considered a letter. Not just A through Z.
Gasper
Updated on January 03, 2020Comments
-
Gasper over 4 years
I need help with regular expressions. My string contains unicode characters and code below doesn't work.
First four characters must be numbers, then comma and then any alphabetic characters or whitespaces... I already read that if i add /u on end of regular expresion but it didn't work for me...
My code works with non-unicode characters
$post = '9999,škofja loka';; echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+', $post);
Thanks for your answers!
-
jensgram almost 13 yearsThe
u
modifier alone is not enough, cf. @jmz's answer. -
Gasper almost 13 yearsThis doesn't work right: this should return 0 but it return 1 $post = '9999,ščćžđkofja loka,.(?*'; echo preg_match('/^[0-9]{4},[\s]*\p{L}+/', $post);
-
Gasper almost 13 yearsdoesn't work = return 0: $post = '9999,škofja loka'; echo preg_match('/^[0-9]{4},[\s\w]+/u', $post);
-
Gasper almost 13 yearsdoesn' work in my case with your code
-
searlea almost 13 yearsNote:
\w
will match numbers too, and\s
doesn't need the square brackets. Being concise:/^\d{4},\s*\w+/u
-
stema almost 13 years@gašper, so now I tested it online and it seems that PHP needs to be double escaped
preg_match('/^\\d{4},[\\s\\w]+$/u', $post);
but it seems that\\w
does not include the unicode characters, even withu
modifier. -
Sodved almost 13 yearsOne thing - in your test program is the $post program in UTF-8? Sorry I'm not that good at php. But in perl if you just enter the character
š
you get a string of one byte 9A. In UTF-8 that character needs to be two bytes C5 A1 (which looks likeš
in a latin character encoding. -
stema almost 13 years@gašper, I did some more testing and updated my answer
-
Gasper almost 13 yearsdid you test it, still doesn't work
-
Gasper almost 13 years@stema, this work completely well, thank you!
-
Gasper almost 13 yearscan i use that regular expression also in js?
-
stema almost 13 years@gašper, I don't think so, http://www.regular-expressions.info/javascript.html is explaining the javascript regex flavour and it says that it does not support unicode (except you give the character explicitly, like
^\d{4},[\sa-zA-Zš]+$
) -
Alan Moore almost 13 yearsEven in UTF-8 mode,
\w
only matches[A-Za-z0-9_]
. You have to use Unicode-specific constructs like\p{L}
as well as the/u
flag. -
Alan Moore almost 13 years@jensgram:
\w
with theu
modifier is not enough either; cf. @stema's answer. ;) -
searlea almost 13 years@alan Bah... I think I'll skip Monday morning in future...
-
jmz almost 13 years@Alan: Locale affects what is a letter and what is not. For me, the regex I posted works (fi_FI.UTF-8 locale).
-
llamerr about 12 yearsthere is a library for unicode in js and much more xregexp.com