grep valid domain regex
Solution 1
A truly complete solution requires more work, but here's an approximation that may work well enough (note that a @
prefix is assumed and the input string is expected to start with it):
^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$
You can use this with egrep
(or grep -E
), but also with [[ ... =~ ... ]]
, bash's regex-matching operator.
Makes the following assumptions, which are more permissive than actual DNS name constraints:
Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g.,
xn--bcher-kva.ch
forbücher.ch
- are not matched - see below.There's no limit on the number of nested subdomains.
There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).
The TLD (last component) is composed of letters only and has a length of at least 2.
Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.
Here's a quick test:
for d in @subdom..dom.ext @dom.ext @subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext @x.org; do
[[ $d =~ \
^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
]] && echo YES || echo NO
done
Support for Internationalized Domain Names (IDN) with literal Unicode characters - again, a complete solution requires more work:
A simple improvement to also match IDNs is to replace [a-zA-Z]
with [[:alpha:]]
and [a-zA-Z0-9]
with [[:alnum:]]
in the above regex; i.e.:
^@(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$
Caveats:
No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix
xn--
, and which would require decoding afterwards.-
As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):
- False positive: an invalid Punycode-encoded name such as
ab--whatever
- False positive: Invalid cross-language names; e.g.,
cαfe.fr
, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone. - False negatives: emoji-based names such as
💄.ws
(xn--jr8h.ws
) - False negative:
பரிட்சை
is a valid TLD in IANA root today, but will not match[[:alpha:]]{2,}$
- ... and many more
- False positive: an invalid Punycode-encoded name such as
Not all Unix-like platforms fully support all Unicode letters when matching against
[[:alpha:]]
or[[:alnum:]]
. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g.,ü
,á
) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.I'm unclear on whether names in right-to-left writing scripts are properly matched.
For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.
Tip of the hat to @Alfe and for pointing out the problem with IDNs, and to @Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.
Solution 2
Use
grep '@[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'
Arka
Updated on July 04, 2022Comments
-
Arka almost 2 years
I'm trying to make a regex for grep that match only valid domains.
My version work pretty well but match the following invalid domain :
@subdom..dom.ext
Here is my regex :
echo "@dom.ext" | grep "^@[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"
I'm working with bash so I escaped special characters.
Sample that should match :
@subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext
Thanks for help
-
Alfe over 10 yearsdomains can start with numbers?
-
mklement0 over 10 yearsAlso, you need to escape the last occurrence of
.
. -
mklement0 over 10 yearsYou assume 3 components (subdomains), but the OP also wants to match domains (2 components). Also, aren't TLDs composed of letters only (.com, .info, ...)?
-
Alfe over 10 yearsYeah, right. But when going into such details we also could consider the unicode domains (things like www.müller.de), then [a-z] also would not be enough and I fear that the
[[:alnum:]]
ofgrep
would maybe also handle those umlauts incorrectly (depending on codecs etc.). So I guess we can leave it the way it is with your accepted answer; if that works for the OP, it should be enough. If however he wants to have a definitive answer, I think none of ours would be enough yet ;-) -
mklement0 over 10 yearsGood points, thanks. I've updated my post and provided at least some answers and have also clarified limitations of my solution.
-
Patrick Mevzek over 5 years@Alfe, yes,
3com.com
is a valid domain name. That restriction on not starting with a number was removed long ago from DNS specifications. -
Patrick Mevzek over 5 yearsThis has the same problem as @mklement0 for IDNs, it will have many false positives and false negatives.