Regex validation of email addresses according to RFC5321/RFC5322

17,515

Nestable comments make the grammar for email-addresses irregular (context-free). If you preclude comments however, the resulting grammar is regular. The primary definition allows for (folding) whitespace between lexical tokens (e.g. a @ b.com). Removing all folding whitespace results in a canonical form.

This is the regex for canonical email addresses according to RFC 5322 (precluding comments):

([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|\[[\t -Z^-~]*])

If you need to accept folding whitespace, then this is the regular expression for email addresses according to RFC 5322 (precluding comments):

((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?"(((([\t ]*\r\n)?[\t ]+)?([]!#-[^-~]|(\\[\t -~])))+(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?)"(([\t ]*\r\n)?[\t ]+)?)@((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?\[((([\t ]*\r\n)?[\t ]+)?[!-Z^-~])*(([\t ]*\r\n)?[\t ]+)?](([\t ]*\r\n)?[\t ]+)?)

Valid email addresses are further restricted in RFC 5321 (SMTP). It basically leaves alone the part before the @-sign, but accepts only host names or address literals after the @-sign. ("---.---" is a valid dot-atom, but not a valid host name and "[...]" is a valid domain literal, but not a valid address literal.)

The grammar presented in RFC 5321 is too lenient when it comes to both host names and IP addresses. I took the liberty of "correcting" the rules in question, using this draft and RFC 1034 (section 3.5) as guidelines. Here's the resulting regex.

([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@([0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?(\.[0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?)*|\[((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}|IPv6:((((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){6}|::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){5}|[0-9A-Fa-f]{0,4}::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){4}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):)?(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){3}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,2}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){2}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,3}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,4}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::)((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3})|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,5}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,6}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::)|(?!IPv6:)[0-9A-Za-z-]*[0-9A-Za-z]:[!-Z^-~]+)])

All regexes are POSIX EREs. The last one uses a negative lookahead. See here for the derivations of the regular expressions.

Share:
17,515

Related videos on Youtube

Rinke
Author by

Rinke

Updated on October 09, 2022

Comments

  • Rinke
    Rinke 3 months

    Does anyone know a regex that validates email addresses according to RFC5321/RFC5322?

    Since (nestable) comments make the grammar irregular, only addresses without comments should be regarded.

    Of course, if you're interested in validating an address that is actually owned by someone then the only real validation is to send an email to the address and check if the owner received it. I am however purely interested in the RFC standards. For a practical approach this question is more relevant.

    On top of comments I am willing to sacrifice folding white space, but apart from that I'm not interested in expressions that reject any addresses that are RFC5321/2-valid. (Arguably it would even make sense in some circumstances to disregard folding white space.)

    Ideally the regex would reject anything that's not RFC-valid, but that's less important. It's not so interesting to include an exhausive list of top-level domains in the regex for example. Simply accepting any top-level domain will suffice.

    I'm not sure if address tags (e.g. [email protected]) are part of the RFCs I mentioned, but I would like the regex to validate these.

    IPv6 should definitly be handled correctly (RFC5952).

    As I understand internationalized email (RFC6530, RFC6531, RFC6532, RFC6533) is still in the experimental phase, but an expression validating these addresses would also be interesting.

    To make the answers universally interesting it would be nice if any regular expressions were in POSIX format.

    • Bergi
      Bergi about 10 years
      That's impossible with traditional regex flavours. Email adresses can contain comments with arbitrarily deep nesting, and such is not parsable by a regular expression grammar.
  • Rinke
    Rinke almost 10 years
    Thanks. I knew about this regex, but I'm interested in RFC5321/2.
  • Michael Stramel
    Michael Stramel almost 8 years
    RFC822 is outdated and should be using RFC5322 instead. en.wikipedia.org/wiki/List_of_RFCs
  • Mihail Krivushin
    Mihail Krivushin over 4 years
    This regexps are no complaint with rfc6532, due it restricts contact part to ascii.
  • Rinke
    Rinke over 4 years
    @MihailKrivushin Couldn’t agree more. The question was about RFC5321/2 specifically though...
  • mxmlnkn
    mxmlnkn over 3 years
    Why is there no a-z in the character groups in the first regex. And what characters does the ^-~ include? Is that range wanted?
  • Rinke
    Rinke over 3 years
    @mxmlnkn The a-z range is included in ^-~. If you search for an ASCII table you can see which characters are included in the ranges.
  • LonelyCpp
    LonelyCpp about 3 years
    this throws a empty character class warning from eslint - eslint.org/docs/rules/no-empty-character-class