Question about URL Validation with Regex

12,463

Solution 1

The below regex is from the wonderful Mastering Regular Expressions book. If you are not familiar with the free spacing/comments mode, I suggest you get familiar with it.

\b
# Match the leading part (proto://hostname, or just hostname)
(
    # ftp://, http://, or https:// leading part
    (ftp|https?)://[-\w]+(\.\w[-\w]*)+
  |
    # or, try to find a hostname with our more specific sub-expression
    (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
    # Now ending .com, etc. For these, require lowercase
    (?-i: com\b
        | edu\b
        | biz\b
        | gov\b
        | in(?:t|fo)\b # .int or .info
        | mil\b
        | net\b
        | org\b
        | name\b
        | coop\b
        | aero\b
        | museum\b
        | [a-z][a-z]\b # two-letter country codes
    )
)

# Allow an optional port number
( : \d+ )?

# The rest of the URL is optional, and begins with / . . . 
(
     /
     # The rest are heuristics for what seems to work well
     [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
     (?:
        [.!,?]+  [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
     )*
)?

To explain this regex briefly (for a full explanation get the book) - URLs have one or more dot separated parts ending with either a limited list of final bits, or a two letter country code (.uk .fr ...). In addition the parts may have any alphanumeric characters or hyphens '-', but hyphens may not be the first or last character of the parts. Then there may be a port number, and then the rest of it.

To extract this from the website, go to http://regex.info/listing.cgi?ed=3&p=207 It is from page 207 of the 3rd edition.

And the page says "Copyright © 2008 Jeffrey Friedl" so I'm not sure what the conditions for use are exactly, but I would expect that if you own the book you could use it so ... I'm hoping I'm not breaking the rules putting it here.

Solution 2

If you read section 5 of the URL specification (http://www.isi.edu/in-notes/rfc1738.txt) you'll see that the syntax of a URL is at a minimum:

scheme ':' schemepart

where scheme is 1 or more characters and schemepart is 0 or more characters. Therefore if you don't have a colon, you don't have a URL.

That said, /users/ don't care if they've given you a url, to them it looks like one. So here's what I do:

BEFORE validation, if there isn't a colon in it, prepend http://, then run it through whatever validator you want. This turns any legitimate hostname (which may not include domain info, after all) into something that looks like a URL.

frob  ->  http://frob

(Nearly) the only rule for the host part is that it can't begin with a digit if it contains no dots. Now, there are specific validations that should be performed for specific schemes, which none of the regexes given thus far accomplish. But, spec compliance is probably not what you want to 'validate'. Therefore a dns query on the hostname portion may be useful, but unless you're using the same resolver in the same context as your user, it isn't going to work in all cases.

Share:
12,463
Anders
Author by

Anders

Currently freelancing my IT skills while working part-time as an IT consultant for a design firm. SOreadytohelp

Updated on June 04, 2022

Comments

  • Anders
    Anders almost 2 years

    I have the following regex that does a great job matching urls:

    ((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)`
    

    However, it does not handle urls without a prefix, ie. stackoverflow.com or www.google.com do not match. Anyone know how I can modify this regex to not care if there is a prefix or not?


    EDIT: Does my question too vague? Does it need more details?


    (((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\)))?[\w\d:#@%/;$()~_?\+-=\\\.&]*)
    

    I added a ()? around the protocols like Vinko Vrsalovic suggested, but now the regex will match nearly any string, as long as it has valid URL characters.

    My implementation of this is I have a database that I manage the contents, and it has a field that either has plain text, a phone number, a URL or an email address. I was looking for an easy way to validate the input so I can have it properly formatted, ie. creating anchor tags for the url/email, and formatting the phone number how I have the other numbers formatted throughout the site. Any suggestions?

  • Anders
    Anders over 15 years
    are you saying to replace the contents of the square brackets with .*?
  • Douglas Leeder
    Douglas Leeder over 15 years
    no replace the entire regular expression. Or better just remove the regular expression and treat everything as a url.