grep to extract lines that contains full domain names from a file

grep regular-expression string search

10,728

Solution 1

your question is ambiguous. if your definition of domain only item like you mentioned, to find them you could use:

grep -P "^.[^.]+\.[a-zA-Z]{3}$|^.[^.]+\.[a-zA-Z]{2}\.[a-zA-Z]{2}$" FileName

grep -P use Perl regex
^.[^.]+ start with every char except and not contain . as much would
\.[a-zA-Z]{3}$ occures . follow with 3 chars in the end
| OR
^.[^.]+ like above
\.[a-zA-Z]{2} occures 2 chars two times twise in the end

Solution 2

Given the way TLDs & FLDs get dolled out by registrars this is a non-trivial problem that I don't think you'll be able to tackle with simple regexes and CLI tools.

I'd lean on something like this Python module, tld. This module has both a get_tld and get_fld function. The 2nd one will print first level domains, which is what you're looking for.

Example

$ cat fld.py
#!/bin/python

from tld import get_fld

fldList = []
domList = open("domlist.txt").read().splitlines()
for dom in domList:
  fldList.append(get_fld(dom, fix_protocol=True))

print("\n".join(sorted(set(fldList))))

Sample run:

$ ./gtld.py
domain.co.uk
domain.com

NOTE: The list of domains is in a file called domlist.txt.

References

10,728

user9371654

Updated on September 18, 2022

Comments

user9371654 over 1 year
I have a large file that contains domain names in the form of:
```
domain.com
sub.domain.com
sub.domain.co.uk
domain.co.uk
```
I want to extract main domain names (no sub domains) with top level domain name (e.g. .com) or with country code top level domain name.

The top level domain name is always between 2-3 letters (e.g. .com, .net, .gov)

The country code top level domain name is always 2 letters (e.g. .uk, .us) and comes at the end of line.

So if the above list in an input, the output should extract:
```
domain.com
domain.co.uk
```
I tried this expression:
```
grep -P '^[^\.]+\.[a-zA-Z]{2,3}\.[a-zA-Z]{2}$
```
This is my interpretation. -P: perl regex ^: beginning of line ^\.: exclude dot +: one or more times \.: dot [a-zA-Z]{2,3}: two or three alphabetical characters (e.g., .com, .co) [a-zA-Z]{2}$: two alphabetical characters at the end of the line

My questions: The output I get always extracts:
```
domain.co.uk
```
But not domain.com

How to make my regex extracts domain names with or without country code top level domain names like domain.com and domain.co.uk BUT without subdomains like sub.domain.co.uk or sub.domain.com
- Michael Homer almost 6 years
  
  How are you going to distinguish "domain.ltd.uk" (first-level) and "subdomain.bbc.uk" (second-level)?