Detect and extract url from a string?

java regex url

64,604

Solution 1

m.group(1) gives you the first matching group, that is to say the first capturing parenthesis. Here it's (https?|ftp|file)

You should try to see if there is something in m.group(0), or surround all your pattern with parenthesis and use m.group(1) again.

You need to repeat your find function to match the next one and use the new group array.

Solution 2

Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases. Trying to write the perfect expression for something like this is very difficult. That said, I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes. Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found. It's definitely not trivial:

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
        "(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
                + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
                + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Here's an example of using it:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
    int matchStart = matcher.start(1);
    int matchEnd = matcher.end();
    // now you have the offsets of a URL match
}

Solution 3

/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
    List<String> containedUrls = new ArrayList<String>();
    String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
    Matcher urlMatcher = pattern.matcher(text);

    while (urlMatcher.find())
    {
        containedUrls.add(text.substring(urlMatcher.start(0),
                urlMatcher.end(0)));
    }

    return containedUrls;
}

Example:

List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");

for (String url : extractedUrls)
{
    System.out.println(url);
}

Prints:

https://stackoverflow.com/
http://www.google.com/

Solution 4

Detecting URLs is not an easy task. If its enough for you to get a string that starts with https?|ftp|file then it could be fine. Your problem here is, that you have a capturing group, the () and those are only around the first part http...

I would make this part a non capturing group using (?:) and put brackets around the whole thing.

"\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

Solution 5

With some extra brackets around the whole thing (except word boundary at start) it should match the whole domain name:

"\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

I don't think that regex matches the whole url though.

View more solutions

64,604

Author by

Shisoft

Studies in Donghua University Founder of Shisoft,A web service integrate team.

Updated on March 25, 2021

Comments

Shisoft about 3 years

This is a easy question,but I just don't get it. I want to detect url in a string and replace them with a shorten one.

I found this expression from stackoverflow,But the result is just http

Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(str);
        boolean result = m.find();
        while (result) {
            for (int i = 1; i <= m.groupCount(); i++) {
                String url=m.group(i);
                str = str.replace(url, shorten(url));
            }
            result = m.find();
        }
        return html;

Is there any better idea?

Christian Brüggemann over 8 years

This one unfortunately also matches a dot following the URL.
Thomas Wana about 8 years

Doesn't handle URLs in text correctly. Preceding whitespace is incorrectly handled (newlines swallowed), and accepts colons, dots etc after the URL.
Abdullah Khan over 7 years

This works even for trailing commas and whitespaces.. Great
4gus71n over 7 years

Doesn't work with something like <a href="www.google.com">google link</a> It returns "www.google.com
Steve Waring about 7 years

Downvoted because there should be eight backslashes not four. Putting them inside double quotes reduces the number of backslashes to four in the string. The regex interpretation of \\ to match a single \ reduces the number to two which is what you are trying to match. Also you can use none captureing groups, so (?://|\\\\)
Steve Waring about 7 years

I just made the same mistake, i ment (?://|\\\\\\\\)
ed22 over 6 years

doesn't work if url is in parenthesis (www.myurl.com) - returns "www.myurl.com)"
BullyWiiPlaza over 5 years

Updates in regards to what?
Jonathan Morales Vélez over 5 years

it doesn't work when string contains \n: Sources:\nhttps://sites.google.com/view/kgssourcesbeauty/sta‌rtseite\n is not recognized as a link
parsecer over 4 years

Big thank you. Your answer is a life-saver for regex newbies like myself.
mwarren over 2 years

The linkedIn url-detector code did not work for me. It extracted a url from a bunch of russian words where there was a full-stop at the end of a sentence followed immediately by the beginning of the next sentence without a space after the dot. Here is what it found - дней.Не - It then added http:// in front, which wasn't there in the text. At the very least it should exclude non-latin characters right?