PHP validation/regex for URL
Solution 1
I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:
$text = preg_replace(
'#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
"'<a href=\"$1\" target=\"_blank\">$3</a>$4'",
$text
);
Most of the random junk at the end is to deal with situations like http://domain.com.
in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.
Solution 2
Use the filter_var()
function to validate whether a string is URL or not:
var_dump(filter_var('example.com', FILTER_VALIDATE_URL));
It is bad practice to use regular expressions when not necessary.
EDIT: Be careful, this solution is not unicode-safe and not XSS-safe. If you need a complex validation, maybe it's better to look somewhere else.
Solution 3
As per the PHP manual - parse_url should not be used to validate a URL.
Unfortunately, it seems that filter_var('example.com', FILTER_VALIDATE_URL)
does not perform any better.
Both parse_url()
and filter_var()
will pass malformed URLs such as http://...
Therefore in this case - regex is the better method.
Solution 4
As per John Gruber (Daring Fireball):
Regex:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))
using in preg_match():
preg_match("/(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))/", $url)
Here is the extended regex pattern (with comments):
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?:// # http or https protocol
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
For more details please look at: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Solution 5
Just in case you want to know if the url really exists:
function url_exist($url){//se passar a URL existe
$c=curl_init();
curl_setopt($c,CURLOPT_URL,$url);
curl_setopt($c,CURLOPT_HEADER,1);//get the header
curl_setopt($c,CURLOPT_NOBODY,1);//and *only* get the header
curl_setopt($c,CURLOPT_RETURNTRANSFER,1);//get the response as a string from curl_exec(), rather than echoing it
curl_setopt($c,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
if(!curl_exec($c)){
//echo $url.' inexists';
return false;
}else{
//echo $url.' exists';
return true;
}
//$httpcode=curl_getinfo($c,CURLINFO_HTTP_CODE);
//return ($httpcode<400);
}
AndreLiem
Updated on June 28, 2020Comments
-
AndreLiem almost 4 years
I've been looking for a simple regex for URLs, does anybody have one handy that works well? I didn't find one with the zend framework validation classes and have seen several implementations.
-
Owen over 15 yearsthis is definitely a great alternative, unfortunately it's php 5.2+ (unless you install the PECL version)
-
Alan Moore almost 15 yearsSome things that jump out at me: use of alternation where character classes are called for (every alternative matches exactly one character); and the replacement shouldn't have needed the outer double-quotes (they were only needed because of the pointless /e modifier on the regex).
-
andrewbadera over 14 years^(http://|https://)?(([a-z0-9]?([-a-z0-9]*[a-z0-9]+)?){1,63}\.)+[a-z]{2,6} (may be too greedy, not sure yet, but it's more flexible on protocol and leading www)
-
Admin over 14 yearsEregi will be removed in PHP 6.0.0. And domains with "öäåø" will not validate with your function. You probably should convert the URL to punycode first?
-
Frankie over 14 years@incidence absolutely agree. I wrote this in March and PHP 5.3 only came out late June setting eregi as DEPRECATED. Thank you. Gonna edit and update.
-
Gumbo over 14 years@John Scipione:
google.com
is only a valid relative URL path but not a valid absolute URL. And I think that’s what he’s looking for. -
vamin almost 14 yearsThere's a bug in 5.2.13 (and I think 5.3.2) that prevents urls with dashes in them from validating using this method.
-
Kzqai almost 14 yearsThis argument doesn't follow. If FILTER_VALIDATE_URL is a little more permissive than you want, tack on some additional checks to deal with those edge cases. Reinventing the wheel with your own attempt at a regex against urls is only going to get you further from a complete check.
-
Kzqai almost 14 yearsSee all the shot-down regexes on this page for examples of why -not- to write your own.
-
catchdave almost 14 yearsYou make a fair point Tchalvak. Regexes for something like URLs can (as per other responses) be very hard to get right. Regex is not always the answer. Conversely regex is also not always the wrong answer either. The important point is to pick the right tool (regex or otherwise) for the job and not be specifically "anti" or "pro" regex. In hindsight, your answer of using filter_var in combination with constraints on its edge-cases, looks like the better answer (particularly when regex answers start to get to greater than 100 chars or so - making maintenance of said regex a nightmare)
-
Hawkins Entrekin over 13 yearsfilter_var will reject test-site.com, I have domain names with dashes, wheter they are valid or not. I don't think filter_var is the best way to validate a url. It will allow a url like
http://www
-
Apostolos Tsakpinis over 13 years
-
Softy over 13 yearsThis doesn't work in this case - it includes the trailing ": 3 cantari noi in albumul <a href="audio.resursecrestine.ro/cantece/index-autori/andrei-rosu/…>
-
liviucmg about 13 yearsOne particular problem: This validates URLs according to RFC 2396 which does not allow underscores in subdomains, but some websites do have underscores in subdomains.
-
Benji XVI about 13 yearsThe other problem with this method is it is not unicode-safe.
-
Stephen P almost 13 years@Softy something like
http://example.com/somedir/...
is a perfectly legitimate URL, asking for the file named...
- which is a legitimate file name. -
Yzmir Ramirez almost 13 yearsI would still do some kind of validation on
$url
before actually verifying the url is real because the above operation is expensive - perhaps as much as 200 milliseconds depending on file size. In some cases the url may not actually have a resource at its location available yet (e.g. creating a url to an image that has yet to be uploaded). Additionally you're not using a cached version so its not likefile_exists()
that will cache a stat on a file and return nearly instantly. The solution you provided is still useful though. Why not just usefopen($url, 'r')
? -
Yzmir Ramirez almost 13 yearsCorrect me if I'm wrong, but can we still assume TLDs will have a minimum of 2 characters and maximum of 6 characters?
-
Zack Zatkin-Gold over 12 yearsThe
filter_var
function has since been updated and now it's possible to validate URLs effectively with dashes included, rendering the your comment incorrect, @vamin (see bug report here). -
vamin over 12 years@zzatkin, the bug report states that the fix is incorporated into the later 5.2.14 and 5.3.3 versions (it came too late for 5.2.13 and 5.3.2), though I agree it's not really an issue anymore so long as you keep PHP up to date.
-
PJ Brunet about 12 yearsThanks, just what I was looking for. However, I made a mistake trying to use it. The function is "url_exist" not "url_exists" oops ;-)
-
siliconpi about 12 yearsIs there any security risk in directly accessing the user entered URL?
-
andrewsi over 11 yearsWill that match URLs that begin with
ftp:
? -
Bretticus over 11 yearsIt also will validate onedomain.com<br>http://www.anotherone.com<br>http:/… I'm finding out today. Not what I had in mind! Going back to a regular expression alternative (PHP Version => 5.4.4)
-
Sawny over 11 yearsDosen't accept UTF-8 characters. Will return false for
http://wiki.com/öva/mä/åäö
. -
Shahbaz over 10 years/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
-
mic over 10 yearsThe filter_var appears to validate all different kinds of URL formats whether they are valid or not, it seems that the regex is the way to correctly validate URL's
-
Joko Wandiro over 10 yearsI'm using Zend\Validator\Regex to validate url using your pattern, but it still detect
http://www.example
to be valid -
pgee70 over 9 yearsHi this solution is good, and i upvoted it, but it doesn't take into account the standard port for https: -- suggest you just replace 80 with '' where it works out the port
-
bhaskarc about 9 yearsyet another issue is that it does not validate against newer tlds like .me, .cm .guru etc
-
RisingSun about 9 yearsThis is a bad solution which should not have so many up votes. Highly XSS vulnerable.
-
suspectus almost 9 yearsThis answer duplicates one of the answers from 2008!
-
Raz0rwire almost 8 yearsI ended up implementing a variation on this, because my domain cares whether an URL actually exists or not :)
-
Nick Rice over 7 years@YzmirRamirez (All these years later...) If there was any doubt when you wrote your comment there certainly isn't now, with TLDs these days such as .photography
-
Nick Rice over 7 yearsDownvoted as dangerous. Read the comments about it the online PHP manual!
-
Yzmir Ramirez over 7 years@NickRice you are correct...how much the web changes in 5 years. Now I can't wait until someone makes the TLD .supercalifragilisticexpialidocious
-
Jeff Puckett over 7 yearsThere are a lot more top level domains.
-
user3396065 over 7 yearsDoesn't works with link like: 'www.w3schools.com/home/3/?a=l'
-
user3396065 over 7 yearsThrows: ErrorException: Undefined index: scheme if the protocol is not specified i suggest to check if is set before.
-
Tim Groeneveld over 7 years@user3396065, can you please provide an example input that throws this?
-
S. Imp about 7 yearsFILTER_VALIDATE_URL has a lot of problems that need fixing. Also, the docs describing the flags do not reflect the actual source code where references to some flags have been removed entirely. More info here: news.php.net/php.internals/99018
-
Camaleo about 6 yearsyou would like to add a check if a 404 was found: <code> $httpCode = curl_getinfo( $c, CURLINFO_HTTP_CODE ); //echo $url . ' ' . $httpCode . '<br>'; if( $httpCode == 404 ) { echo $url.' 404'; } </code>
-
Nic3500 almost 6 yearsDid you notice how old this question is? Please explain your regex, users who do not know already will have a hard time understanding it without details.
-
thespacecamel over 5 yearsHree's another article explaining the problems with this: d-mueller.de/blog/…
-
dmmd over 4 yearsIsn't safe at all.. any input URL would be actively accessed.
-
Arris almost 4 yearsit is a bad solution, 'cause
a://site.com
is valid for FILTER_VALIDATE_URL (PHP 7.2 and older versions) -
Ben Birney over 3 yearsTo work, the pattern needs to escape the forward slashes with backslashes in three points: preg_match("/(?i)\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))*))+(?:(([^\s()<>]+|(([^\s()<>]+)))*)|[^\s`!() 1;]{};:'\".,<>?«»“”‘’]))/", $url)
-
mickmackusa over 2 yearsYour
.
,?
,+
,^
,{
,}
,=
,|
,$
, backtick, and[
do not need escaping in your character classes.+
is even repeated in one of your character classes.:
does not need to be escaped. -
mickmackusa over 2 years
(http|https)
is more simplyhttps?
. The excessive use of pipes in this pattern negative impacts readability and brevity. Many of the escaped characters in your pattern do not need escaping.