Remove subdomain from string in ruby

11,997

Solution 1

I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix

require 'rubygems'
require 'domainatrix'

url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix       # => "net"
url.domain    # => "pauldix"
url.canonical # => "net.pauldix"

url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix       # => "co.uk"
url.domain    # => "pauldix"
url.subdomain # => "foo.bar"
url.path      # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"

Solution 2

This is a tricky issue. Some top-level domains do not accept registrations at the second level.

Compare example.com and example.co.uk. If you would simply strip everything except the last two domains, you would end up with example.com, and co.uk, which can never be the intention.

Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.

You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!

Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.

Solution 3

For posterity, here's an update from Oct 2014:

I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.

In combination with URI.parse for stripping protocol and paths, it works really well:

❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"

Solution 4

The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).

Ready for a complex regular expression? :)

re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip

Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|) will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$ will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).

I tested this expression on the following samples:

foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk

Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!

Solution 5

Something like:

def remove_subdomain(host)
    # Not complete. Add all root domain to regexp
    host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end

puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl

You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.

Share:
11,997
Admin
Author by

Admin

Updated on June 04, 2022

Comments

  • Admin
    Admin almost 2 years

    I'm looping over a series of URLs and want to clean them up. I have the following code:

    # Parse url to remove http, path and check format
    o_url = URI.parse(node.attributes['href'])
    
    # Remove www
    new_url = o_url.host.gsub('www.', '').strip
    

    How can I extend this to remove the subdomains that exist in some URLs?

  • shadowbq
    shadowbq about 11 years
    This ruby gem references the Mozilla data file at publicsuffix.org.
  • alexvicegrab
    alexvicegrab almost 9 years
    Works better than URI, in my experience, for instance in Youtube videos URI removes the ?v=******** field leaving only /watch, whereas Domainatrix works perfectly
  • djsumdog
    djsumdog over 5 years
    This regex breaks if you have a TLD with just two letters. For example: a.b.bigsense.io should give me bigsense.io, but instead it gives me b.bigsense.io
  • murb
    murb over 5 years
    Since this gem hasn't received any updates, see @DarrenCheng's for a more up to date gem: github.com/weppos/publicsuffix-ruby