grep to extract lines that contains full domain names from a file

10,728

Solution 1

your question is ambiguous. if your definition of domain only item like you mentioned, to find them you could use:

grep -P "^.[^.]+\.[a-zA-Z]{3}$|^.[^.]+\.[a-zA-Z]{2}\.[a-zA-Z]{2}$" FileName
  • grep -P use Perl regex
  • ^.[^.]+ start with every char except and not contain . as much would
  • \.[a-zA-Z]{3}$ occures . follow with 3 chars in the end
  • | OR
  • ^.[^.]+ like above
  • \.[a-zA-Z]{2} occures 2 chars two times twise in the end

Solution 2

Given the way TLDs & FLDs get dolled out by registrars this is a non-trivial problem that I don't think you'll be able to tackle with simple regexes and CLI tools.

I'd lean on something like this Python module, tld. This module has both a get_tld and get_fld function. The 2nd one will print first level domains, which is what you're looking for.

Example

$ cat fld.py
#!/bin/python

from tld import get_fld

fldList = []
domList = open("domlist.txt").read().splitlines()
for dom in domList:
  fldList.append(get_fld(dom, fix_protocol=True))

print("\n".join(sorted(set(fldList))))

Sample run:

$ ./gtld.py
domain.co.uk
domain.com

NOTE: The list of domains is in a file called domlist.txt.

References

Share:
10,728

Related videos on Youtube

user9371654
Author by

user9371654

Updated on September 18, 2022

Comments

  • user9371654
    user9371654 over 1 year

    I have a large file that contains domain names in the form of:

    domain.com
    sub.domain.com
    sub.domain.co.uk
    domain.co.uk
    

    I want to extract main domain names (no sub domains) with top level domain name (e.g. .com) or with country code top level domain name.

    The top level domain name is always between 2-3 letters (e.g. .com, .net, .gov)

    The country code top level domain name is always 2 letters (e.g. .uk, .us) and comes at the end of line.

    So if the above list in an input, the output should extract:

    domain.com
    domain.co.uk
    

    I tried this expression:

    grep -P '^[^\.]+\.[a-zA-Z]{2,3}\.[a-zA-Z]{2}$
    

    This is my interpretation. -P: perl regex ^: beginning of line ^\.: exclude dot +: one or more times \.: dot [a-zA-Z]{2,3}: two or three alphabetical characters (e.g., .com, .co) [a-zA-Z]{2}$: two alphabetical characters at the end of the line

    My questions: The output I get always extracts:

    domain.co.uk
    

    But not domain.com

    How to make my regex extracts domain names with or without country code top level domain names like domain.com and domain.co.uk BUT without subdomains like sub.domain.co.uk or sub.domain.com

    • Michael Homer
      Michael Homer almost 6 years
      How are you going to distinguish "domain.ltd.uk" (first-level) and "subdomain.bbc.uk" (second-level)?