grep to extract lines that contains full domain names from a file
Solution 1
your question is ambiguous. if your definition of domain only item like you mentioned, to find them you could use:
grep -P "^.[^.]+\.[a-zA-Z]{3}$|^.[^.]+\.[a-zA-Z]{2}\.[a-zA-Z]{2}$" FileName
-
grep -P
use Perl regex -
^.[^.]+
start with every char except and not contain.
as much would -
\.[a-zA-Z]{3}$
occures.
follow with 3 chars in the end -
|
OR -
^.[^.]+
like above -
\.[a-zA-Z]{2}
occures 2 chars two times twise in the end
Solution 2
Given the way TLDs & FLDs get dolled out by registrars this is a non-trivial problem that I don't think you'll be able to tackle with simple regexes and CLI tools.
I'd lean on something like this Python module, tld
. This module has both a get_tld
and get_fld
function. The 2nd one will print first level domains, which is what you're looking for.
Example
$ cat fld.py
#!/bin/python
from tld import get_fld
fldList = []
domList = open("domlist.txt").read().splitlines()
for dom in domList:
fldList.append(get_fld(dom, fix_protocol=True))
print("\n".join(sorted(set(fldList))))
Sample run:
$ ./gtld.py
domain.co.uk
domain.com
NOTE: The list of domains is in a file called domlist.txt
.
References
Related videos on Youtube
user9371654
Updated on September 18, 2022Comments
-
user9371654 over 1 year
I have a large file that contains domain names in the form of:
domain.com sub.domain.com sub.domain.co.uk domain.co.uk
I want to extract main domain names (no sub domains) with top level domain name (e.g. .com) or with country code top level domain name.
The top level domain name is always between 2-3 letters (e.g. .com, .net, .gov)
The country code top level domain name is always 2 letters (e.g. .uk, .us) and comes at the end of line.
So if the above list in an input, the output should extract:
domain.com domain.co.uk
I tried this expression:
grep -P '^[^\.]+\.[a-zA-Z]{2,3}\.[a-zA-Z]{2}$
This is my interpretation.
-P:
perl regex^:
beginning of line^\.:
exclude dot+:
one or more times\.:
dot[a-zA-Z]{2,3}:
two or three alphabetical characters (e.g., .com, .co)[a-zA-Z]{2}$:
two alphabetical characters at the end of the lineMy questions: The output I get always extracts:
domain.co.uk
But not
domain.com
How to make my regex extracts domain names with or without country code top level domain names like
domain.com
anddomain.co.uk
BUT without subdomains likesub.domain.co.uk
orsub.domain.com
-
Michael Homer almost 6 yearsHow are you going to distinguish "domain.ltd.uk" (first-level) and "subdomain.bbc.uk" (second-level)?
-