grep valid domain regex

17,610

Solution 1

A truly complete solution requires more work, but here's an approximation that may work well enough (note that a @ prefix is assumed and the input string is expected to start with it):

^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$

You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.

Makes the following assumptions, which are more permissive than actual DNS name constraints:

  • Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g., xn--bcher-kva.ch for bücher.ch - are not matched - see below.

  • There's no limit on the number of nested subdomains.

  • There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).

  • The TLD (last component) is composed of letters only and has a length of at least 2.

  • Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.

Here's a quick test:

for d in @subdom..dom.ext @dom.ext @subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext @x.org; do
 [[ $d =~ \
    ^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
 ]] && echo YES || echo NO
done

Support for Internationalized Domain Names (IDN) with literal Unicode characters - again, a complete solution requires more work:

A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:

^@(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$

Caveats:

  • No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.

  • As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):

    • False positive: an invalid Punycode-encoded name such as ab--whatever
    • False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone.
    • False negatives: emoji-based names such as 💄.ws (xn--jr8h.ws)
    • False negative: பரிட்சை is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
    • ... and many more
  • Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.

  • I'm unclear on whether names in right-to-left writing scripts are properly matched.

  • For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.

Tip of the hat to @Alfe and for pointing out the problem with IDNs, and to @Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.

Solution 2

Use

grep '@[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'
Share:
17,610
Arka
Author by

Arka

Updated on July 04, 2022

Comments

  • Arka
    Arka almost 2 years

    I'm trying to make a regex for grep that match only valid domains.

    My version work pretty well but match the following invalid domain :

    @subdom..dom.ext
    

    Here is my regex :

    echo "@dom.ext" | grep "^@[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"
    

    I'm working with bash so I escaped special characters.

    Sample that should match :

    @subdom.dom.ext
    @subsubdom.subdom.dom.ext
    @subsub-dom.sub-dom.ext
    

    Thanks for help

  • Alfe
    Alfe over 10 years
    domains can start with numbers?
  • mklement0
    mklement0 over 10 years
    Also, you need to escape the last occurrence of ..
  • mklement0
    mklement0 over 10 years
    You assume 3 components (subdomains), but the OP also wants to match domains (2 components). Also, aren't TLDs composed of letters only (.com, .info, ...)?
  • Alfe
    Alfe over 10 years
    Yeah, right. But when going into such details we also could consider the unicode domains (things like www.müller.de), then [a-z] also would not be enough and I fear that the [[:alnum:]] of grep would maybe also handle those umlauts incorrectly (depending on codecs etc.). So I guess we can leave it the way it is with your accepted answer; if that works for the OP, it should be enough. If however he wants to have a definitive answer, I think none of ours would be enough yet ;-)
  • mklement0
    mklement0 over 10 years
    Good points, thanks. I've updated my post and provided at least some answers and have also clarified limitations of my solution.
  • Patrick Mevzek
    Patrick Mevzek over 5 years
    @Alfe, yes, 3com.com is a valid domain name. That restriction on not starting with a number was removed long ago from DNS specifications.
  • Patrick Mevzek
    Patrick Mevzek over 5 years
    This has the same problem as @mklement0 for IDNs, it will have many false positives and false negatives.