How to validate non-english (UTF-8) encoded email address in Javascript and PHP?

12,383

Solution 1

Attempting to validate email addresses may not be a good idea. The specifications (RFC5321, RFC5322) allow for so much flexibility that validating them with regular expressions is literally impossible, and validating with a function is a great deal of work. The result of this is that most email validation schemes end up rejecting a large number of valid email addresses, much to the inconvenience of the users. (By far the most common example of this is not allowing the + character.)

It is more likely that the user will (accidentally or deliberately) enter an incorrect email address than in an invalid one, so actually validating is a great deal of work for very little benefit, with possible costs if you do it incorrectly.

I would recommend that you just check for the presence of an @ character on the client and then send a confirmation email to verify it; it's the most practical way to validate and it confirms that the address is correct as well.

Solution 2

Since 5.2 PHP has a build in validation for email addresses. But I'm not sure if it works for UFT-8 encoded strings:

echo filter_var($email, FILTER_VALIDATE_EMAIL);

In the original PHP source code you will find the reg exp for validating email, this can be used for manually validating when using PHP < 5.2.

Update

idn_to_ascii() can be used to "Convert domain name to IDNA ASCII form." Which then can be validated with filter_var($email, FILTER_VALIDATE_EMAIL);

// International domains
if (function_exists('idn_to_ascii') && strpos($email, '@') !== false) {
    $parts = explode('@', $email);
    $email = $parts[0].'@'.idn_to_ascii($parts[1]);
}
$is_valid = filter_var($email, FILTER_VALIDATE_EMAIL);

Solution 3

As offered by Mario, playing around a bit, I came up with the following regex to validate non-standard email address:

^([\p{L}\_\.\-\d]+)@([\p{L}\-\.\d]+)((\.(\p{L}){2,63})+)$

It would validate any proper email address with all kind of Unicode letters, with TLD limitations from 2 to 63 characters.

Please check it and let me know if there are any flaws.

Example Online

Share:
12,383

Related videos on Youtube

Deepak Shrestha
Author by

Deepak Shrestha

I am a developer and technology enthusiast. Like to learn about new technologies and frameworks.

Updated on May 23, 2022

Comments

  • Deepak Shrestha
    Deepak Shrestha almost 2 years

    Part of a website I am currently working on contains registration process where users have to provide their email address. Just recently I became aware that non-ascii based domains are possible (so is email). My backend is utf-8 encoded MySQL where I am expecting any users (with differnt locales) should be able to enter their email but don't know how to validate this kind of email address.

    Currently I am testing out jquery tools and it validates the english email address correctly but fails to validate non ascii email. Also I need to do same at server side with php. Is there a regular expression that can validate this kind of email address?

    I have tried this but it fails in jquery tools (this is just example for demo, I don't understand this too)

    闪闪发光@闪闪发光.com

    Also what will happen when they type their English email address ([email protected]) with their own IME. Can this be validated with current regular expression we have for English mail validation. Currently I don't have to worry if that email exist for not.

    Thanks

  • Jeremy
    Jeremy about 13 years
    There is nothing limiting TLDs to 2-6 characters, and given ICANN's decision to allow the creation of arbitrary ones it seems reasonable to assume that addresses such as .microsoft will be in use before too long. Also, it is possible for spaces to be included in valid email addresses if they are properly escaped.
  • Deepak Shrestha
    Deepak Shrestha about 13 years
    Thanks for the suggestion. I wanted to know if mailers like sendmail or phpmail can handle this UTf-8 encoded email address right out of the box without any modification in my part.
  • emmanuel honore
    emmanuel honore about 13 years
    no prob, extend the {2,6} to what ever you want. It could also replaced by [^ ].
  • Deepak Shrestha
    Deepak Shrestha about 13 years
    Thanks for the info. Validation of this kind seems like a herculean task to me.
  • emmanuel honore
    emmanuel honore about 13 years
    It is not a trivial question. Try to cover as much as you can with your reg exp. Check this link to see what the real reg exp would look like in PERL: ex-parrot.com/~pdw/Mail-RFC822-Address.html
  • Deepak Shrestha
    Deepak Shrestha about 13 years
    Thanks. I guess this is one step to right direction for server side validation.
  • symcbean
    symcbean about 13 years
    that regex doesn't really do any validation (will return false positives and false negatives)
  • Edson Medina
    Edson Medina over 11 years
    \w doesn't match . or - (which are valid characters for both domain and email)
  • s.co.tt
    s.co.tt over 10 years
    While technically correct that validating an email with regex is nearly impossible, I couldn't disagree more with this answer as a general solution. In most real world (non-theoretical) applications, you'd be storing the relevant email address in a database, and/or doing some manipulation on it in the future. Allowing any old UTF-8 string to pass unencumbered to the data layer is a terrible idea. I'd rather reject a few "off the wall" valid email addresses than have a 100% chance of a clever injection attack. In the real world, "hi"\ ~e^ery!@myhost won't come up too often.
  • Ilia
    Ilia over 10 years
    No, it doesn't support UTF-8!
  • D.A.H
    D.A.H over 9 years
    It's valid for PHP, not for JavaScript.
  • Ilia
    Ilia over 9 years
    @D.A.H JavaScript does not support Unicode shortcuts. You could use Steven Levithan's XRexExp package with Unicode add-ons - xregexp.com/plugins.
  • The Bndr
    The Bndr over 9 years
    @EdsonMedina >all emails end with .com< That depends. This answer is more an example. If you build an company internal webpage and if you need to validate the mail address in order to allow company internal address only, than this could by one way. Of cause an strict mail-syntax is needed.
  • Ilia
    Ilia over 5 years
    What a nice email address! :-) Okay, I've updated the regex. Underscores are indeed allowed by many email providers. Thanks.
  • Jeremy
    Jeremy over 5 years
    @IliaRostovtsev Sorry, didn't see your comment until now. Upvoted. Thanks!
  • Jeff Clayton
    Jeff Clayton over 2 years
    Note for 2021: UTF-8 additions in PCRE (tested in preg_replace in PHP 7.3) may prefer \p{Pd} instead of \- for hyphens, and \p{Nd} instead of \d for decimal numbers if your code seems to fail after upgrading.