Need regex to get domain + subdomain

15,637

Solution 1

This regex should match a domain in a string, including any dubdomains:

/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/

Translated to rough english, it functions like this: "match the first part of the string that has 'sometextornumbers.sometext', and also include any number of 'sometextornumbers.' that might preceed it.

See it in action here: http://regexr.com?2vppk

Note that the multiline and global flags in that link are only there to be able to match the entire blob of test-text, so you don't need if you're passing only one line to the regex

Solution 2

Good luck with the above as Domain names now contain non-roman characters. These would have to be processed into equivalent but unique ascii before regex could work reliably. See RFC 3490 Internationalizing Domain Names in Applications (IDNA) ... See https://www.rfc-editor.org/rfc/rfc3490 which has

Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. This document defines
internationalized domain names (IDNs) and a mechanism called
Internationalizing Domain Names in Applications (IDNA) for handling
them in a standard fashion. IDNs use characters drawn from a large
repertoire (Unicode), but IDNA allows the non-ASCII characters to be
represented using only the ASCII characters already allowed in so-
called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.

Share:
15,637
Andreas
Author by

Andreas

Updated on June 04, 2022

Comments

  • Andreas
    Andreas almost 2 years

    So im using this function here:

    function get_domain($url)
    {
      $pieces = parse_url($url);
      $domain = isset($pieces['host']) ? $pieces['host'] : '';
      if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
        return $regs['domain'];
      }
      return false;
    }
    
    $referer = get_domain($_SERVER['HTTP_REFERER']);
    

    And what i need is another regex for it, if someone would be so kind to help. Exactly what i need is for it to get the whole domain, including subdomains.

    Lets say as a real problem i have now. When people blogging link from example: myblog.blogger.com The referer url will be just blogger.com, which is not ideal..

    So if someone could help me so i can get the including subdomain as regex code for the function above, id apreciate it alot!

    Thanks!

  • FallDi
    FallDi over 8 years
    domain and subdomain also can contain dash(-)
  • liquidki
    liquidki about 7 years
    Per RFC, hostname labels cannot begin or end with a hyphen.
  • Toto
    Toto over 4 years
    Why and how is this better? It matches |||||||a.zzzzzz for example. Please, have a look at these sites: TLD list; valid/invalid addresses; regex for RFC822 email address
  • SeriousM
    SeriousM about 4 years
    this would be only valid for two-char toplevel domain names. what about "vienna", "berlin" or "com"?
  • Alexandre Salomé
    Alexandre Salomé about 3 years
    Your expression does not work without a port number, as show here: regex101.com/r/RZSKc1/1 - you should make it optional. Also, adding a | in the brackets allows to use it. I created you many examples here: regex101.com/r/W93YdL/1
  • chris
    chris about 3 years
    Hey, thanks for your feedback! I improved the regex, it reacts correct to all of your tests except for your first 'invalid' domain, which doesn't seem wrong to me.