Javascript/Regex for finding just the root domain name without sub domains

13,567

Solution 1

You can't do this with a regular expression because you don't know how many blocks are in the suffix.

For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.

If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.

You will actually need to look up the suffix from a list like http://publicsuffix.org/

Solution 2

Don't use regex, use the .split() method and work from there.

var s = domain.split('.');

If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:

return s.slice(-2).join('.');

It'll make your eyes bleed less than any regex solution.

Solution 3

I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

EDIT:

To clarify, it's looking for:

one or more alpha-numeric characters or dashes, followed by a literal dot

and then one of three things...

  1. three or more alpha characters (i.e. com/net/mil/coop, etc.)
  2. two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
  3. two alpha characters (i.e. us/uk/to, etc)

and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).

As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.

Solution 4

If you have limited subset of data, I suggest to keep the regex simple, e.g.

(([a-z\-]+)(?:\.com|\.fr|\.co.uk))

This will match:

www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com

In my case, I know that all relevant URLs will be matched using this regex.

Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.

Share:
13,567

Related videos on Youtube

jamesmhaley
Author by

jamesmhaley

James specialises in full-stack development with JavaScript, Typescript, React, Node, GraphQL and MongoDB. He has extensive experience with TDD, Kubernetes, Google Kubernetes Engine, Google Cloud Platform, Istio, DevOps and is fluent in Serverless technologies. He is a extremely personable team player but also has the ability to work as a team lead or sole-developer. James has vast experienced working under agile/scrum methodologies and is also at his best when given the opportunity to set the culture of a team.

Updated on April 28, 2022

Comments

  • jamesmhaley
    jamesmhaley almost 2 years

    I had a search and found lot's of similar regex examples, but not quite what I need.

    I want to be able to pass in the following urls and return the results:

    • www.google.com returns google.com

    • sub.domains.are.cool.google.com returns google.com

    • doesntmatterhowlongasubdomainis.idont.wantit.google.com returns google.com

    • sub.domain.google.com/no/thanks returns google.com

    Hope that makes sense :) Thanks in advance!-James

    • Pekka
      Pekka over 13 years
      What is the result going to be for sub.domain.google.co.uk?
    • Gumbo
      Gumbo over 13 years
      Those are not URLs but just domain names (except the last that is just a string that can be interpreted as domain name plus a URL path).
    • janmoesen
      janmoesen over 13 years
      Be sure to check out the Public Suffix List at publicsuffix.org.
  • jamesmhaley
    jamesmhaley over 13 years
    Could you explain what it does please, my understanding of regex is minimal. And how it would be implemented.
  • hallvors
    hallvors over 13 years
    90% is generous. Basically, there IS no simple way to do this. The domain name system is way too convoluted and allows a lot of variation.
  • theraccoonbear
    theraccoonbear over 13 years
    Given that the examples provided are "normalish" looking domains, I think you can probably hit a substantial chunk, but sure, maybe not 90%. As I said though (and really to the point) it's unlikely you'll get 100% for all of your test cases.