Concrete JavaScript regular expression for accented characters (diacritics)

162,968

Solution 1

The easier way to accept all accents is this:

[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷

See Unicode Character Table for characters listed in numeric order.

Solution 2

The accented Latin range \u00C0-\u017F was not quite enough for my database of names, so I extended the regex to

[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars

I added these code blocks (\u00C0-\u024F includes three adjacent blocks at once):

Note that \u00C0-\u00FF is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7 and divide ÷ \u00F7.

[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷

If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.

The original regex stopping at \u017F borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")

Solution 3

Which of these three approaches is most suited for the task?

Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)

The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:

/[^,]+,\s[^,]+/

But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.

Solution 4

The XRegExp library has a plugin named Unicode that helps solve tasks like this.

<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
  var unicodeWord = XRegExp("^\\p{L}+$");

  unicodeWord.test("Русский"); // true
  unicodeWord.test("日本語"); // true
  unicodeWord.test("العربية"); // true
</script>

Solution 5

You can use this:

/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/
Share:
162,968

Related videos on Youtube

Chris Cirefice
Author by

Chris Cirefice

Computer Science &amp; French double-major, with a minor in Applied Linguistics at Grand Valley State University. I'm also studying Japanese and soon, Russian. I enjoy playing piano, guitar and singing, writing my own music, as well as cooking and playing tennis. I also recently got into homebrewing! I'm a full stack web/Android developer with experience in: Ruby (and Ruby on Rails) Java (and Android) SQL (PostgreSQL/MySQL) JavaScript (and Node.js/Google Apps Script) HTML (and Slim) CSS (and Bootstrap) Contact me: christophercirefice; the domain is gmail! https://www.linkedin.com/in/chriscirefice

Updated on July 08, 2022

Comments

  • Chris Cirefice
    Chris Cirefice almost 2 years

    I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?"

    I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.

    This was my original version, until I wanted to add diacritic support:

    /^[a-zA-Z]+,\s[a-zA-Z]+$/

    Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:

    Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):


    var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
    // Build the full regex
    var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
    // Create a RegExp from the string version
    regexCompiled = new RegExp(regex);
    // regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
    
    • This correctly matches a last/first name with any of the supported accented characters in accentedCharacters.

    My other approach was to use the . character class, to have a simpler expression:

    var regex = /^.+,\s.+$/;
    
    • This would match for just about anything, at least in the form of: something, something. That's alright I suppose...

    The last approach, which I just found might be simpler...

    /^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
    
    • It matches a range of Unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.

    Here are my concerns:

    1. The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.

    2. The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what . matches, just the generalization of "any character except the newline character" (from a table on the MDN).

    3. The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F seems to be pretty solid, at least for my expected input.

    • Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don't have to worry about out-of-Latin-character-set characters

    Which of these three approaches is most suited for the task? Or are there better solutions?

    • Jongware
      Jongware over 10 years
      There seems to be no particular reason to use the more complicated regexps. Only thing about the most simple solution is, it will also match "something, something, something". You could use something like regex = /^[^,]+,\s[^,]+$/; to prevent that.
    • Jongware
      Jongware over 10 years
      At a glance, the first one won't match the common name "O'Donnell, Chris" nor compound last names with a hyphen, nor multiple last names (etc.). See Falsehoods Programmers Believe About Names for just about every possible pitfalls.
    • Bergi
      Bergi over 10 years
      "the . atom matches anything except newlines" actually is quite exact :-)
    • stema
      stema over 10 years
      If it is possible for you to use an additional library you can have a look at my answer here
    • Chris Cirefice
      Chris Cirefice over 10 years
      Jongware, I actually just read that article while I was browsing SO for an answer to my question - I also completely forgot about hyphens and apostrophes and the like, I was more concerned with making it international first :P I'm glad you brought it up though! And Stema, I actually looked at that library and I avoid incorporating libraries because this is all on Google Apps Script - incorporating external libraries would be a nightmare, and I would only be using it (in this case) for one particular field... kind of overkill :P
  • Chris Cirefice
    Chris Cirefice over 10 years
    Hm, maybe you're right. I probably over-complicated it... Could you explain the regex you provided? I've been working with regex for a little while now, but only basic stuff, and really I don't have a clue what yours actually does! Ha
  • Bergi
    Bergi over 10 years
    It's a negated character class - meaning "anything besides the comma".
  • Chris Cirefice
    Chris Cirefice over 10 years
    Ah, so it reads more like any_character_not_a_comma, any_character_not_a_comma? That's what I thought when I first read it, I got kind of confused when I saw three commas in there.
  • Bergi
    Bergi over 10 years
    Yes exactly. Sorry for the confusion with the missing s for the whitespace…
  • Chris Cirefice
    Chris Cirefice over 10 years
    Yep, I figured that was supposed to be \s, but my first thought was oh, I wonder what `\` does? Haha no big deal, thanks!
  • Chris Cirefice
    Chris Cirefice over 9 years
    Nice, turns out that I didn't actually need to regex on unicode, but rather on the pattern anything, anything. This will be useful for future readers :)
  • Pierre Henry
    Pierre Henry about 8 years
    It works nicely, +1, but could you elaborate why it works ?
  • Angad
    Angad about 8 years
    @PierreHenry the - defines a range, and this technique exploits the ordering of characters in the charset to define a continuous range, making for a super concise solution to the problem
  • Pierre Henry
    Pierre Henry about 8 years
    Thanks. Does it work with Unicode and other Latin charsets (such as iso-8859-1) as well ? (or, is the ordering of the character sthe same across different charsets ?) I think these additional details should be added to the answer since the solution itself is quite elegant imo.
  • jcuenod
    jcuenod almost 8 years
    won't this match underscores (and the other non-word characters between Z and a)?
  • Nate
    Nate over 7 years
    This matches at least the characters [, ], ^, and \, none of which should be included.
  • Jérémy Pouyet
    Jérémy Pouyet over 7 years
    Not working, few characters in this range are not accented characters (U+00D7 is the multiplication sign for example) see this: unicode-table.com/en
  • Phil
    Phil about 7 years
    \S is way to permissive for names, it will accet hyphens and such
  • Bergi
    Bergi about 7 years
    @fdsfdsfdsfds You've never seen a name with a hyphen? There are many.
  • JLRishe
    JLRishe about 7 years
    This matches [, \, ], ^, _, and `.
  • Mateo Tibaquira
    Mateo Tibaquira almost 7 years
    I needed to select words with special chars, and thanks to this approach I ended up splitting them with /\b[^\s]+/g
  • Bergi
    Bergi almost 7 years
    @MateoTibaquirá You can simplify [^\s] to \S
  • Scott Flack
    Scott Flack over 6 years
    I'm not getting matches for any other characters mentioned above (underscores, slashes, power of sign etc) using value.search(/^[a-zÀ-ÿ \-]+$/i)
  • Illia Ratkevych
    Illia Ratkevych over 6 years
    This is still "works" only for Latin-based languages. Does not work for Chinese or Cyrillic languages.
  • cprcrack
    cprcrack almost 6 years
    Having a look at the unicode table latin block, I think you should also include \u1e00-\u1eff, so I'm doing [a-zA-Z\u00c0-\u024f\u1e00-\u1eff]
  • 1.21 gigawatts
    1.21 gigawatts over 5 years
    @IlliaRatkevych There are a lot of language characters that can be added. Do you want to add Cyrillic? Use unicode-table.com/en table to select the ranges and add them to the set.
  • barbsan
    barbsan over 5 years
    But OP wants to allow accented characters.
  • Gajus
    Gajus over 4 years
    Doesn't match Š.
  • Gajus
    Gajus over 4 years
    Doesn't match Šš.
  • bigsee
    bigsee almost 4 years
    I know the OP was asking about regex but this was a solid answer and solved the issue for me. See the current top voted answer question here for a fuller explanation.
  • therobyouknow
    therobyouknow over 3 years
    what do you mean when you say "includes" / "does not include" [ ] ^ \ × ÷ - these are math operations not accented letters.
  • therobyouknow
    therobyouknow over 3 years
    Is it because when using - as in À-ÿ for example, the math characters, [ ] ^ \ × ÷ , are defined in the character set within that range, even though they are not themselves accented characters using in words.
  • SunWuKung
    SunWuKung over 3 years
    Doesn't match ŐőŰű .
  • Barnee
    Barnee over 3 years
    This is the same thing but with glyphs: [a-zA-ZÀ-ÖÙ-öù-ÿĀ-žḀ-ỿ0-9].
  • 219CID
    219CID almost 3 years
    this removes Japanese characters - any idea how to include those?
  • pacoverflow
    pacoverflow almost 3 years
    @Gajus Then just put those 2 in the character class!
  • pacoverflow
    pacoverflow almost 3 years
    Reading the comments and seeing all the accented letters that aren't matched, and all the non-letters that are matched, it appears there is no good solution to this problem.
  • Gajus
    Gajus almost 3 years
    @pacoverflow The concern is not whether Šš are matched specifically, but if they are not matched, then the question becomes what else is not matched.
  • Ahmed Fasih
    Ahmed Fasih over 2 years
    This should now work with all JS runtimes supporting Unicode property escapes! But you need to tweak it a bit, adding {} around L and M: /[\p{L}\p{M}\p{Zs}.-]+/gu. This matches Chinese characters as well, so if you want to only match Latin characters with accents, try /[\p{Script=Latin}\p{M}\p{Zs}.-]+/gu. For a large table of many useful character categories, check javascript.info/regexp-unicode
  • TylerH
    TylerH over 2 years
    This is useful for anyone else matching the exact same word in the exact same language, but that's not what this question is about (and it's unlikely anyone else will share this extremely specific requirement of yours). Answers should directly address the question, not be orthogonally related, at best.