Regex modifier /u in JavaScript?

31,522

Solution 1

The /u modifier is for unicode support. Support for it was added to JavaScript in ES2015.

Read http://stackoverflow.com/questions/280712/javascript-unicode to learn more information about unicode in regex with JavaScript.


Polish characters:

Ą \u0104
Ć \u0106
Ę \u0118
Ł \u0141
Ń \u0143
Ó \u00D3
Ś \u015A
Ź \u0179
Ż \u017B
ą \u0105
ć \u0107
ę \u0119
ł \u0142
ń \u0144
ó \u00F3
ś \u015B
ź \u017A
ż \u017C

All special Polish characters:

[\u0104\u0106\u0118\u0141\u0143\u00D3\u015A\u0179\u017B\u0105\u0107\u0119\u0142\u0144\u00F3\u015B\u017A\u017C]

Solution 2

JavaScript doesn't have any notion of UTF-8 strings, so it's unlikely that you need the /u flag. (Your strings are probably already in the usual JavaScript form, one UTF-16 code-unit per "character".)

The bigger problem is that JavaScript doesn't support \p{L}, nor any equivalent notation; JavaScript regexes have no awareness of Unicode character properties. See the answers to this StackOverflow question for some ways to approximate it.


Edited to add: If you only need to support Polish letters, then you can write /^[\sa-zA-ZĄĆĘŁŃÓŚŹŻąćęłńóśźż]+$/. The a-z and A-Z parts cover the ASCII letters, and then the remaining letters are listed out individually.

Solution 3

As of ES2015, /u is supported in JavaScript. See:

Share:
31,522
Scott
Author by

Scott

Updated on July 12, 2022

Comments

  • Scott
    Scott almost 2 years

    Recently I have created a regex, for my PHP code which allows only the letters (including special characters plus spaces), but now I'm having a problem with converting it (?) into the JavaScript compatible regex, here it is: /^[\s\p{L}]+$/u, the problem is the /u modifier at the end of the regex pattern, as the JavaScript doesn't allow such flag.

    How can I rewrite this, so it will work in the JavaScript as well?

    Is there something to allow only Polish characters: Ł, Ą, Ś, Ć, ...

  • Scott
    Scott over 11 years
    Bad news... so maybe there is something to allow only those Polish characters: Ł, Ą, Ś, Ć, Ę instead?
  • Rich
    Rich over 11 years
    One might argue that the modifier isn't needed in any language/environment that properly handles Unicode instead of a mishmash of binary data and actual Unicode text in strings such as PHP.
  • Rich
    Rich over 11 years
    Scott, if you have a small set of characters you want to allow you can always use a character class.
  • Ωmega
    Ωmega over 11 years
    @Joey - The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
  • Ωmega
    Ωmega over 11 years
    @Scott - Polish language use latin, so go with ranges [\u0000-\u007F] = Basic Latin; [\u0080-\u00FF] = Latin-1 Supplement; [\u0100-\u017F] = Latin Extended-A; [\u0180-\u024F] = Latin Extended-B; ... which together get [\u0000-\u024F] to include all latin characters :)
  • Scott
    Scott over 11 years
    @Joey Yea, generally I would like to additionaly allow only those special characters I mentioned above.
  • Rich
    Rich over 11 years
    Ωmega, I know why the flag is needed in PCRE and fundamentally it's the problem that PHP doesn't have a defined character set for strings, leading to some strings being in some legacy character set, some in UTF-8, some storing even non-text binary data. Environments such as Java or .NET have it far easier in that regard, given that text is always Unicode.
  • DamirR
    DamirR over 11 years
    In Javascript regexp you can refer to unicode chars like this: \u0161. For example this will allow only printable ASCII and Ć: var newtxt = txt.replace(/[^\u0107\u0020-\u007e]/g, '') . Unicode codes for your chars find for example here: fileformat.info/info/unicode/char/107/index.htm
  • ruakh
    ruakh over 11 years
    @DamirR: What a bizarre comment. /\u0107/ is equivalent to /Ć/; why on Earth would you prefer the former?
  • DamirR
    DamirR over 11 years
    @ruakh: Life is full of bizarre moments. :) For /Ć/ to work you MUST save js file in UTF-8. Sometimes, other people might use, change, save your code and they might use other encoding (eg. iso-8859-1). So /Ć/ will not be saved correctly and script will not work. If you use /\u0107/ that kind of bugs will be avoided.
  • Aaron
    Aaron almost 8 years
    This answer is one of the first results on Google when searching for "regex u flag", so you might want to update it with a preface stating that it has been defined in ES2016 and is now supported by most recent browsers :)
  • Poul Bak
    Poul Bak over 5 years
    It's currently not supported by all browsers.
  • Admin
    Admin over 5 years
    @PoulBak It says on the Mozilla docs it's supported by all major browsers, unless they got it wrong.
  • Poul Bak
    Poul Bak over 5 years
    Some versions of Edge will simply crash, if you use it, but I think that has been fixed, so you're probably right (noone use IE any more).
  • Liggliluff
    Liggliluff over 4 years
    @Ωmega If you only want to catch letters, you could use: [\u0041-\u005A\u0061-\u007A\u00C0-\u00D6\u00D8-\u00F6\u00F8-‌​\u02B8] but it certainly doesn't look at neat.