Get domain from URL in Oracle SQL

10,011

Solution 1

Thanks to the hints in the answers I finally got it working!

The code I am using now looks like this:

REGEXP_REPLACE(website_url, '(http[s]?://)?(www\.)?(.*?)((/|:)(.)*|$)', '\3')

Thanks for the help everybody!

Solution 2

Use this :

WITH tab AS 
 (SELECT 'https://www.example.co.uk/dir/index.html' AS website_url 
    FROM dual)
SELECT REGEXP_SUBSTR(REGEXP_REPLACE(website_url, '^http[s]?://(www\.)?|^www\.', '', 1), '\w+(\.\w+)+') 
  FROM tab;

output:

|REGEXP_SUBSTR(REGEXP_REPLACE(W|
--------------------------------
|example.co.uk                 |

Solution 3

Not sure whether oracle supports the ?: to exclude a group or not.

REGEXP_REPLACE(website_url, '^(?:(?:http[s]?://)?www\.)?(.*?)(?:/.*|$)', '\1')

If it doesn't, then this one:

REGEXP_REPLACE(website_url, '^((http[s]?://)?www\.)?(.*?)(/.*|$)', '\3')
Share:
10,011

Related videos on Youtube

Foaly
Author by

Foaly

Sometimes I do stuff with computers.

Updated on June 04, 2022

Comments

  • Foaly
    Foaly almost 2 years

    I have a database that contains website URL's. From those URL's I'd like to extract the domain name. Here are two (quiet different) examples:

    http://www.example.com       -> example.com
    example.co.uk/dir/index.html -> example.co.uk
    

    In order to do this I am using a regular expression and the functions REGEXP_SUBSTR and REGEXP_REPLACE that Oracle provides. I am using replace to replace the preceding http[s] and the www. with an empty string (deleting it). Then I use substring to get the string between the beginning and the first / or if there is no / the whole string. My code looks like this:

    REGEXP_SUBSTR(REGEXP_REPLACE(website_url, '^http[s]?://(www\.)?|^www\.', '', 1), '(.+?)(/|$)')
    

    Everything works as expected, except the fact that my regex fails to exclude the /:

    example.com/dir/index.html -> example.com/
    

    I would like to get rid of the /. How do I do that?

  • Foaly
    Foaly over 10 years
    This works very nice! Thank you very much. But sadly it doesn't work for URL's that include a - for example the URl www.top.i-am-a-example.com gives top.i I tried but I can't fix it. Do you know how?
  • San
    San over 10 years
    Adding permissible range could be one solution to this. REGEXP_SUBSTR(REGEXP_REPLACE(website_url, '^http[s]?://(www\.)?|^www\.', '', 1), '[a-z,A-Z,0-9,-]+(\.\w+)+')
  • Foaly
    Foaly over 10 years
    Yes adding a range seems to be the only option. Using the your code I still get top.i. I am not an expert on regex, so I don't know why... Looks correct to me
  • Foaly
    Foaly over 10 years
    As far as I can see it Oracle does not support ?: the second works as expected, but somehow it does not work for urls like: www.example.com/dir/index.html it returns: example.comdir/index.html