Bash based regex domain name validation

10,847

Solution 1

I find this to be a more comprehensive regex:

(?=^.{4,253}$)(^(?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){0,61}[a-zA-Z0-9])?\.)+([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])$)

  • RFC 1034§3: Allows for a length of 4-253, with the shortest operational domain I'm aware of, "t.co", still matching where the other answers don't. 255 bytes is the maximum length, minus the length octet for each label (TLD and "primary" subdomain) gives us 253: (?=^.{4,253}$)
    • RFC 3696§2: Single-letter TLDs are technically permitted, meaning the minimum length would be 3, but as there are currently no single-letter TLDs a minimum length of 4 is practical.
  • RFC 1034§3: Allows numbers in subdomains, which Conor Clafferty's apparently doesn't (by not distinguishing other subdomains from "primary" subdomains -- i.e. the domain you register -- which the DNS spec doesn't)
  • RFC 1034§3: Restricts individual labels to 63 characters, permitting hyphens in the middle while restricting the beginning and end to alphanumerics (?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){,61}[a-zA-Z0-9])?\.)
  • Requires a two-letter or larger TLD, but may be punycoded ([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])
    • RFC 3696§2: The DNS spec technically permits numerics in the TLD, as well as single-letter TLDs; however, there are currently no single-letter TLDs or TLDs with numbers currently, and all-numeric TLDs are not permitted, so this part of the regex has been simplified to [a-zA-Z]{2,}.

      --OR--

    • RFC 3490§5: an internationalized domain name ccTLD (IDN ccTLD) may be punycoded, as indicated by an "xn--" prefix, after which it may contain letters, numbers, or hyphens. This approximates to xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]

      Be aware that this pattern does not validate a punycode TLD! Invalid punycode will be tolerated, e.g. "xn--qqqq", because attempting to validate punycode against the appropriate encoding mechanisms is beyond the scope of a regular expression. While punycode itself technically permits an encoded string ending in a hyphen, RFC 3492§5 observes and respects the IDNA limitation that labels may not end in a hyphen.

EDIT 02/2021: Hat tip to user2241415 for pointing out that IDN ccTLDs did not match the previously-specified regex.

Solution 2

You are missing a question mark in your regex :

(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)

You can test your regex here

You can do what you want with grep :

$ echo test.com | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
test.com
$ echo test | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
$

Solution 3

No sed implementation I am aware of supports the various Perl extensions you are using in that regex. Try with Perl or grep -P or pcregrep, or simplify the regex to something sed can cope with. Here is a quick and dirty adaptation which splits the regex into a script of three different regexes, and rejects when something fails to match (or matches, in the middlemost case).

echo 'test' | sed -r '/^.{5,254}$/!d
    /^([^.]*\.)*[0-9]+\./d   # Seems incorrect; 112.com is valid
    /^([a-zA-Z0-9_\-]{1,63}\.?)+([a-zA-Z]{2,})$/!d'  # should disallow underscore
    # also, what's with the question mark after the literal dot?

This also completely fails to accept IDNA domains (which can contain dashes and numbers in the TLD, among other things) so I would definitely not recommend this, but hopefully it shows you how to adapt something like this to sed if you wish to.

Share:
10,847
Peter
Author by

Peter

Updated on August 06, 2022

Comments

  • Peter
    Peter over 1 year

    I want to create a script that will add new domains to our DNS Servers. I found that Fully qualified domain name validation REGEX. However, when I use it with sed, it is not working as I would expect:

    echo test | sed  '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'  
    --------
    Output is: 
    test
    echo test.com | sed  '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'  
    --------
    Output is: 
    test.com
    

    I expected that the output of the first command should be a blank line. What do I do wrong?

  • Doktor J
    Doktor J over 9 years
    shouldn't the beginning actually be (?=^.{4,254}$) ? "t.co" is a valid domain (and currently in use!), and is only 4 characters long...
  • tripleee
    tripleee about 9 years
    The "fix" is incorrect. The spurious dot now allows for two consecutive dots before the TLD. A better fix would be to remove the question mark after the literal dot which was already there (but it's technically incorrect; e.g. dk alone is a valid domain name).
  • Seth Holladay
    Seth Holladay about 9 years
    Thank you for being so precise, explaining yourself, and citing sources. Helps a lot in making a quick, informed choice.
  • Dennis
    Dennis almost 8 years
    if I tested test.-com, it passes. that's not valid,right?
  • Hudson Santos
    Hudson Santos almost 7 years
    Worked like a charm for me.. I've also coded a bash function called isdom, so I can call it with 'isdom string' and it responds yes/no based on this regexp..
  • Hudson Santos
    Hudson Santos almost 7 years
    Didn't work for me.. Tryit yourself for example: echo fireb | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-‌​zA-Z]{2,})$)'. It will return: fireb. But it is not a domain name. Another example: echo berif_novp | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-‌​zA-Z]{2,})$)'. Returns: berif_novp, but this is not a domain too. Even trying on rubular.com it is matching strings that are not domains.
  • sgohl
    sgohl almost 6 years
    why no work? echo example.com|grep -P '(?=^.{4,253}$)(^(?:[a-zA-Z](?:(?:[a-zA-Z0-9\-]){,61}[a-zA-Z‌​])?\.)+[a-zA-Z]{2,}$‌​)'
  • Doktor J
    Doktor J over 5 years
    @roothahn please see my edit. Apparently some interpretations of PCRE (heh) don't like implicit lower bounds ({,61}) so I added an explicit lower bound ({0,61}) and it plays much nicer with grep: echo example.com|grep -P '(?=^.{4,253}$)(^(?:[a-zA-Z](?:(?:[a-zA-Z0-9\-]){0,61}[a-zA-‌​Z])?\.)+[a-zA-Z]{2,}‌​$)'
  • Doktor J
    Doktor J over 5 years
    The problem with this regex is that it violates certain rules about domains: 1. domains cannot have underscores; 2. labels may not start or end with a hyphen (first and last characters of each label must be alphanumeric); 3. labels can be entirely numeric (except TLD... maybe), so (?!\d+\.) is inappropriate; 4. the ? quantifier on \. in the main grouping is incorrect, as it allows domains with no periods
  • user2241415
    user2241415 about 3 years
    it doesn't seem to validate new TLDs like - test.xn--kpu716f (per swcs.com.au/tld.htm )
  • Doktor J
    Doktor J about 3 years
    @user2241415 edited in an update that matches IDN ccTLDs!
  • vjwilson
    vjwilson almost 3 years
    hey, this worked for me. Could you explain this regex for me to understand clearly?