Perfect URL validation regex in Java

12,172

Use this version:

"(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$"

You did not have to double the slashes, add regex delimiters, modifiers at the end of the pattern, and turn \u to \x notation.

See IDEONE demo:

String[] URLs = new String[] { "http://foo.com/blah_blah", "http://foo.com/blah_blah/", "http://foo.com/blah_blah_(wikipedia)", "http://foo.bar?q=Spaces should be encoded" };
Pattern REGEX = Pattern.compile("(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$");
for (String url : URLs) {
    Matcher matcher = REGEX.matcher(url);
    if (matcher.find()) {
       System.out.println(matcher.group());
    }
}

Output:

http://foo.com/blah_blah
http://foo.com/blah_blah/
http://foo.com/blah_blah_(wikipedia)

UPDATE

To match URLs in larger texts, you need to replace ^ and $ with \\b:

Pattern REGEX = Pattern.compile("(?i)\\b(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?\\b");

See another demo

Share:
12,172
Danny Lo
Author by

Danny Lo

Updated on June 04, 2022

Comments

  • Danny Lo
    Danny Lo almost 2 years

    I've found that page: https://mathiasbynens.be/demo/url-regex where different regular expressions for URL validation and their possibilities are nicely listed. Diego Perini's regex is the most powerful one and I would like to use it in Java. However it doesn't work if I use it that way:

    public class URLValidation {
        // "\" replaced by "\\"
        private static Pattern REGEX = Pattern.compile("_^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!10(?:\\.\\d{1,3}){3})(?!127(?:\\.\\d{1,3}){3})(?!169\\.254(?:\\.\\d{1,3}){2})(?!192\\.168(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)*(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}]{2,})))(?::\\d{2,5})?(?:/[^\\s]*)?$_iuS");
    
        private static String[] URLs = new String[] { "http://foo.com/blah_blah", "http://foo.com/blah_blah/", "http://foo.com/blah_blah_(wikipedia)", "http://foo.bar?q=Spaces should be encoded" };
    
        public static void main(String[] args) throws Exception {
            for (String url : URLs) {
                Matcher matcher = REGEX.matcher(url);
                if (matcher.find()) {
                    System.out.println(matcher.group());
                }}}}
    

    This code outputs nothing, however it should output the first three URLs in the array. How to compile the regex properly to get the code working?

    upd: Thanks for the proposals. I tested your regexes in the real application. What I do there is iterate through log files and look for URL in each line. A log files have timestamps and usernames enclosed in [] and <> respectively and sometimes can contain special insivible characters responsible for formatting (color, boldness, etc) like \u0003. The regex seems to have problem with that type of strings: http://ideone.com/WEcgBY

    upd2: And how about a regex finding all URLs in a line if it contains several? For example to use it like this:

    String[] urlsFromLine = REGEX.split(line);
    for (String url : urlsFromLine) {
        System.out.println(url);
    }