PHP validation/regex for URL

181,563

Solution 1

I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:

$text = preg_replace(
  '#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
  "'<a href=\"$1\" target=\"_blank\">$3</a>$4'",
  $text
);

Most of the random junk at the end is to deal with situations like http://domain.com. in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.

Solution 2

Use the filter_var() function to validate whether a string is URL or not:

var_dump(filter_var('example.com', FILTER_VALIDATE_URL));

It is bad practice to use regular expressions when not necessary.

EDIT: Be careful, this solution is not unicode-safe and not XSS-safe. If you need a complex validation, maybe it's better to look somewhere else.

Solution 3

As per the PHP manual - parse_url should not be used to validate a URL.

Unfortunately, it seems that filter_var('example.com', FILTER_VALIDATE_URL) does not perform any better.

Both parse_url() and filter_var() will pass malformed URLs such as http://...

Therefore in this case - regex is the better method.

Solution 4

As per John Gruber (Daring Fireball):

Regex:

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))

using in preg_match():

preg_match("/(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))/", $url)

Here is the extended regex pattern (with comments):

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

For more details please look at: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Solution 5

Just in case you want to know if the url really exists:

function url_exist($url){//se passar a URL existe
    $c=curl_init();
    curl_setopt($c,CURLOPT_URL,$url);
    curl_setopt($c,CURLOPT_HEADER,1);//get the header
    curl_setopt($c,CURLOPT_NOBODY,1);//and *only* get the header
    curl_setopt($c,CURLOPT_RETURNTRANSFER,1);//get the response as a string from curl_exec(), rather than echoing it
    curl_setopt($c,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
    if(!curl_exec($c)){
        //echo $url.' inexists';
        return false;
    }else{
        //echo $url.' exists';
        return true;
    }
    //$httpcode=curl_getinfo($c,CURLINFO_HTTP_CODE);
    //return ($httpcode<400);
}
Share:
181,563
AndreLiem
Author by

AndreLiem

Updated on June 28, 2020

Comments

  • AndreLiem
    AndreLiem almost 4 years

    I've been looking for a simple regex for URLs, does anybody have one handy that works well? I didn't find one with the zend framework validation classes and have seen several implementations.

  • Owen
    Owen over 15 years
    this is definitely a great alternative, unfortunately it's php 5.2+ (unless you install the PECL version)
  • Alan Moore
    Alan Moore almost 15 years
    Some things that jump out at me: use of alternation where character classes are called for (every alternative matches exactly one character); and the replacement shouldn't have needed the outer double-quotes (they were only needed because of the pointless /e modifier on the regex).
  • andrewbadera
    andrewbadera over 14 years
    ^(http://|https://)?(([a-z0-9]?([-a-z0-9]*[a-z0-9]+)?){1,63}‌​\.)+[a-z]{2,6} (may be too greedy, not sure yet, but it's more flexible on protocol and leading www)
  • Admin
    Admin over 14 years
    Eregi will be removed in PHP 6.0.0. And domains with "öäåø" will not validate with your function. You probably should convert the URL to punycode first?
  • Frankie
    Frankie over 14 years
    @incidence absolutely agree. I wrote this in March and PHP 5.3 only came out late June setting eregi as DEPRECATED. Thank you. Gonna edit and update.
  • Gumbo
    Gumbo over 14 years
    @John Scipione: google.com is only a valid relative URL path but not a valid absolute URL. And I think that’s what he’s looking for.
  • vamin
    vamin almost 14 years
    There's a bug in 5.2.13 (and I think 5.3.2) that prevents urls with dashes in them from validating using this method.
  • Kzqai
    Kzqai almost 14 years
    This argument doesn't follow. If FILTER_VALIDATE_URL is a little more permissive than you want, tack on some additional checks to deal with those edge cases. Reinventing the wheel with your own attempt at a regex against urls is only going to get you further from a complete check.
  • Kzqai
    Kzqai almost 14 years
    See all the shot-down regexes on this page for examples of why -not- to write your own.
  • catchdave
    catchdave almost 14 years
    You make a fair point Tchalvak. Regexes for something like URLs can (as per other responses) be very hard to get right. Regex is not always the answer. Conversely regex is also not always the wrong answer either. The important point is to pick the right tool (regex or otherwise) for the job and not be specifically "anti" or "pro" regex. In hindsight, your answer of using filter_var in combination with constraints on its edge-cases, looks like the better answer (particularly when regex answers start to get to greater than 100 chars or so - making maintenance of said regex a nightmare)
  • Hawkins Entrekin
    Hawkins Entrekin over 13 years
    filter_var will reject test-site.com, I have domain names with dashes, wheter they are valid or not. I don't think filter_var is the best way to validate a url. It will allow a url like http://www
  • Apostolos Tsakpinis
    Apostolos Tsakpinis over 13 years
    > It will allow a url like 'www' It is OK when URL like 'localhost'
  • Softy
    Softy over 13 years
    This doesn't work in this case - it includes the trailing ": 3 cantari noi in albumul <a href="audio.resursecrestine.ro/cantece/index-autori/andrei-r‌​osu/…>
  • liviucmg
    liviucmg about 13 years
    One particular problem: This validates URLs according to RFC 2396 which does not allow underscores in subdomains, but some websites do have underscores in subdomains.
  • Benji XVI
    Benji XVI about 13 years
    The other problem with this method is it is not unicode-safe.
  • Stephen P
    Stephen P almost 13 years
    @Softy something like http://example.com/somedir/... is a perfectly legitimate URL, asking for the file named ... - which is a legitimate file name.
  • Yzmir Ramirez
    Yzmir Ramirez almost 13 years
    I would still do some kind of validation on $url before actually verifying the url is real because the above operation is expensive - perhaps as much as 200 milliseconds depending on file size. In some cases the url may not actually have a resource at its location available yet (e.g. creating a url to an image that has yet to be uploaded). Additionally you're not using a cached version so its not like file_exists() that will cache a stat on a file and return nearly instantly. The solution you provided is still useful though. Why not just use fopen($url, 'r')?
  • Yzmir Ramirez
    Yzmir Ramirez almost 13 years
    Correct me if I'm wrong, but can we still assume TLDs will have a minimum of 2 characters and maximum of 6 characters?
  • Zack Zatkin-Gold
    Zack Zatkin-Gold over 12 years
    The filter_var function has since been updated and now it's possible to validate URLs effectively with dashes included, rendering the your comment incorrect, @vamin (see bug report here).
  • vamin
    vamin over 12 years
    @zzatkin, the bug report states that the fix is incorporated into the later 5.2.14 and 5.3.3 versions (it came too late for 5.2.13 and 5.3.2), though I agree it's not really an issue anymore so long as you keep PHP up to date.
  • PJ Brunet
    PJ Brunet about 12 years
    Thanks, just what I was looking for. However, I made a mistake trying to use it. The function is "url_exist" not "url_exists" oops ;-)
  • siliconpi
    siliconpi about 12 years
    Is there any security risk in directly accessing the user entered URL?
  • andrewsi
    andrewsi over 11 years
    Will that match URLs that begin with ftp: ?
  • Bretticus
    Bretticus over 11 years
    It also will validate onedomain.com<br>http://www.anotherone.com<br>http:/… I'm finding out today. Not what I had in mind! Going back to a regular expression alternative (PHP Version => 5.4.4)
  • Sawny
    Sawny over 11 years
    Dosen't accept UTF-8 characters. Will return false for http://wiki.com/öva/mä/åäö.
  • Shahbaz
    Shahbaz over 10 years
    /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
  • mic
    mic over 10 years
    The filter_var appears to validate all different kinds of URL formats whether they are valid or not, it seems that the regex is the way to correctly validate URL's
  • Joko Wandiro
    Joko Wandiro over 10 years
    I'm using Zend\Validator\Regex to validate url using your pattern, but it still detect http://www.example to be valid
  • pgee70
    pgee70 over 9 years
    Hi this solution is good, and i upvoted it, but it doesn't take into account the standard port for https: -- suggest you just replace 80 with '' where it works out the port
  • bhaskarc
    bhaskarc about 9 years
    yet another issue is that it does not validate against newer tlds like .me, .cm .guru etc
  • RisingSun
    RisingSun about 9 years
    This is a bad solution which should not have so many up votes. Highly XSS vulnerable.
  • suspectus
    suspectus almost 9 years
    This answer duplicates one of the answers from 2008!
  • Raz0rwire
    Raz0rwire almost 8 years
    I ended up implementing a variation on this, because my domain cares whether an URL actually exists or not :)
  • Nick Rice
    Nick Rice over 7 years
    @YzmirRamirez (All these years later...) If there was any doubt when you wrote your comment there certainly isn't now, with TLDs these days such as .photography
  • Nick Rice
    Nick Rice over 7 years
    Downvoted as dangerous. Read the comments about it the online PHP manual!
  • Yzmir Ramirez
    Yzmir Ramirez over 7 years
    @NickRice you are correct...how much the web changes in 5 years. Now I can't wait until someone makes the TLD .supercalifragilisticexpialidocious
  • Jeff Puckett
    Jeff Puckett over 7 years
    There are a lot more top level domains.
  • user3396065
    user3396065 over 7 years
    Doesn't works with link like: 'www.w3schools.com/home/3/?a=l'
  • user3396065
    user3396065 over 7 years
    Throws: ErrorException: Undefined index: scheme if the protocol is not specified i suggest to check if is set before.
  • Tim Groeneveld
    Tim Groeneveld over 7 years
    @user3396065, can you please provide an example input that throws this?
  • S. Imp
    S. Imp about 7 years
    FILTER_VALIDATE_URL has a lot of problems that need fixing. Also, the docs describing the flags do not reflect the actual source code where references to some flags have been removed entirely. More info here: news.php.net/php.internals/99018
  • Camaleo
    Camaleo about 6 years
    you would like to add a check if a 404 was found: <code> $httpCode = curl_getinfo( $c, CURLINFO_HTTP_CODE ); //echo $url . ' ' . $httpCode . '<br>'; if( $httpCode == 404 ) { echo $url.' 404'; } </code>
  • Nic3500
    Nic3500 almost 6 years
    Did you notice how old this question is? Please explain your regex, users who do not know already will have a hard time understanding it without details.
  • thespacecamel
    thespacecamel over 5 years
    Hree's another article explaining the problems with this: d-mueller.de/blog/…
  • dmmd
    dmmd over 4 years
    Isn't safe at all.. any input URL would be actively accessed.
  • Arris
    Arris almost 4 years
    it is a bad solution, 'cause a://site.com is valid for FILTER_VALIDATE_URL (PHP 7.2 and older versions)
  • Ben Birney
    Ben Birney over 3 years
    To work, the pattern needs to escape the forward slashes with backslashes in three points: preg_match("/(?i)\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]‌​+[.][a-z]{2,4}\/)(?:‌​[^\s()<>]+|(([^\‌​s()<>]+|(([^\s()‌​<>]+)))*))+(‌​?:(([^\s()<>]+|(‌​([^\s()<>]+)‌​))*)|[^\s`!() ‌​1;]{};:'\".,<>?«‌​»“”‘’]))/", $url)
  • mickmackusa
    mickmackusa over 2 years
    Your ., ?, +, ^, {, }, =, |, $, backtick, and [ do not need escaping in your character classes. + is even repeated in one of your character classes. : does not need to be escaped.
  • mickmackusa
    mickmackusa over 2 years
    (http|https) is more simply https?. The excessive use of pipes in this pattern negative impacts readability and brevity. Many of the escaped characters in your pattern do not need escaping.