Regex modifier /u in JavaScript?
Solution 1
The /u
modifier is for unicode support.
Support for it was added to JavaScript in ES2015.
Read http://stackoverflow.com/questions/280712/javascript-unicode to learn more information about unicode in regex with JavaScript.
Polish characters:
Ą \u0104
Ć \u0106
Ę \u0118
Ł \u0141
Ń \u0143
Ó \u00D3
Ś \u015A
Ź \u0179
Ż \u017B
ą \u0105
ć \u0107
ę \u0119
ł \u0142
ń \u0144
ó \u00F3
ś \u015B
ź \u017A
ż \u017C
All special Polish characters:
[\u0104\u0106\u0118\u0141\u0143\u00D3\u015A\u0179\u017B\u0105\u0107\u0119\u0142\u0144\u00F3\u015B\u017A\u017C]
Solution 2
JavaScript doesn't have any notion of UTF-8 strings, so it's unlikely that you need the /u
flag. (Your strings are probably already in the usual JavaScript form, one UTF-16 code-unit per "character".)
The bigger problem is that JavaScript doesn't support \p{L}
, nor any equivalent notation; JavaScript regexes have no awareness of Unicode character properties. See the answers to this StackOverflow question for some ways to approximate it.
Edited to add: If you only need to support Polish letters, then you can write /^[\sa-zA-ZĄĆĘŁŃÓŚŹŻąćęłńóśźż]+$/
. The a-z
and A-Z
parts cover the ASCII letters, and then the remaining letters are listed out individually.
Solution 3
As of ES2015, /u is supported in JavaScript. See:
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode
- https://www.ecma-international.org/ecma-262/6.0/#sec-get-regexp.prototype.unicode
Scott
Updated on July 12, 2022Comments
-
Scott almost 2 years
Recently I have created a regex, for my PHP code which allows only the letters (including special characters plus spaces), but now I'm having a problem with converting it (?) into the JavaScript compatible regex, here it is:
/^[\s\p{L}]+$/u
, the problem is the/u
modifier at the end of the regex pattern, as the JavaScript doesn't allow such flag.How can I rewrite this, so it will work in the JavaScript as well?
Is there something to allow only Polish characters:
Ł
,Ą,
Ś
,Ć
, ... -
Scott over 11 yearsBad news... so maybe there is something to allow only those Polish characters:
Ł
,Ą
,Ś
,Ć
,Ę
instead? -
Rich over 11 yearsOne might argue that the modifier isn't needed in any language/environment that properly handles Unicode instead of a mishmash of binary data and actual Unicode text in strings such as PHP.
-
Rich over 11 yearsScott, if you have a small set of characters you want to allow you can always use a character class.
-
Ωmega over 11 years@Joey - The PHP
preg
functions, which are based on PCRE, support Unicode when the/u
option is appended to the regular expression. -
Ωmega over 11 years@Scott - Polish language use latin, so go with ranges
[\u0000-\u007F]
= Basic Latin;[\u0080-\u00FF]
= Latin-1 Supplement;[\u0100-\u017F]
= Latin Extended-A;[\u0180-\u024F]
= Latin Extended-B; ... which together get[\u0000-\u024F]
to include all latin characters :) -
Scott over 11 years@Joey Yea, generally I would like to additionaly allow only those special characters I mentioned above.
-
Rich over 11 yearsΩmega, I know why the flag is needed in PCRE and fundamentally it's the problem that PHP doesn't have a defined character set for strings, leading to some strings being in some legacy character set, some in UTF-8, some storing even non-text binary data. Environments such as Java or .NET have it far easier in that regard, given that text is always Unicode.
-
DamirR over 11 yearsIn Javascript regexp you can refer to unicode chars like this:
\u0161
. For example this will allow only printable ASCII and Ć:var newtxt = txt.replace(/[^\u0107\u0020-\u007e]/g, '')
. Unicode codes for your chars find for example here: fileformat.info/info/unicode/char/107/index.htm -
ruakh over 11 years@DamirR: What a bizarre comment.
/\u0107/
is equivalent to/Ć/
; why on Earth would you prefer the former? -
DamirR over 11 years@ruakh: Life is full of bizarre moments. :) For
/Ć/
to work you MUST save js file in UTF-8. Sometimes, other people might use, change, save your code and they might use other encoding (eg. iso-8859-1). So/Ć/
will not be saved correctly and script will not work. If you use/\u0107/
that kind of bugs will be avoided. -
Aaron almost 8 yearsThis answer is one of the first results on Google when searching for "regex u flag", so you might want to update it with a preface stating that it has been defined in ES2016 and is now supported by most recent browsers :)
-
Poul Bak over 5 yearsIt's currently not supported by all browsers.
-
Admin over 5 years@PoulBak It says on the Mozilla docs it's supported by all major browsers, unless they got it wrong.
-
Poul Bak over 5 yearsSome versions of Edge will simply crash, if you use it, but I think that has been fixed, so you're probably right (noone use IE any more).
-
Liggliluff over 4 years@Ωmega If you only want to catch letters, you could use:
[\u0041-\u005A\u0061-\u007A\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8]
but it certainly doesn't look at neat.