Detect and extract url from a string?

64,604

Solution 1

m.group(1) gives you the first matching group, that is to say the first capturing parenthesis. Here it's (https?|ftp|file)

You should try to see if there is something in m.group(0), or surround all your pattern with parenthesis and use m.group(1) again.

You need to repeat your find function to match the next one and use the new group array.

Solution 2

Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases. Trying to write the perfect expression for something like this is very difficult. That said, I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes. Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found. It's definitely not trivial:

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
        "(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
                + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
                + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Here's an example of using it:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
    int matchStart = matcher.start(1);
    int matchEnd = matcher.end();
    // now you have the offsets of a URL match
}

Solution 3

/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
    List<String> containedUrls = new ArrayList<String>();
    String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
    Matcher urlMatcher = pattern.matcher(text);

    while (urlMatcher.find())
    {
        containedUrls.add(text.substring(urlMatcher.start(0),
                urlMatcher.end(0)));
    }

    return containedUrls;
}

Example:

List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");

for (String url : extractedUrls)
{
    System.out.println(url);
}

Prints:

https://stackoverflow.com/
http://www.google.com/

Solution 4

Detecting URLs is not an easy task. If its enough for you to get a string that starts with https?|ftp|file then it could be fine. Your problem here is, that you have a capturing group, the () and those are only around the first part http...

I would make this part a non capturing group using (?:) and put brackets around the whole thing.

"\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

Solution 5

With some extra brackets around the whole thing (except word boundary at start) it should match the whole domain name:

"\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

I don't think that regex matches the whole url though.

Share:
64,604
Shisoft
Author by

Shisoft

Studies in Donghua University Founder of Shisoft,A web service integrate team.

Updated on March 25, 2021

Comments

  • Shisoft
    Shisoft about 3 years

    This is a easy question,but I just don't get it. I want to detect url in a string and replace them with a shorten one.

    I found this expression from stackoverflow,But the result is just http

    Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);
            Matcher m = p.matcher(str);
            boolean result = m.find();
            while (result) {
                for (int i = 1; i <= m.groupCount(); i++) {
                    String url=m.group(i);
                    str = str.replace(url, shorten(url));
                }
                result = m.find();
            }
            return html;
    

    Is there any better idea?

  • Christian Brüggemann
    Christian Brüggemann over 8 years
    This one unfortunately also matches a dot following the URL.
  • Thomas Wana
    Thomas Wana about 8 years
    Doesn't handle URLs in text correctly. Preceding whitespace is incorrectly handled (newlines swallowed), and accepts colons, dots etc after the URL.
  • Abdullah Khan
    Abdullah Khan over 7 years
    This works even for trailing commas and whitespaces.. Great
  • 4gus71n
    4gus71n over 7 years
    Doesn't work with something like <a href="www.google.com">google link</a> It returns "www.google.com
  • Steve Waring
    Steve Waring about 7 years
    Downvoted because there should be eight backslashes not four. Putting them inside double quotes reduces the number of backslashes to four in the string. The regex interpretation of \\ to match a single \ reduces the number to two which is what you are trying to match. Also you can use none captureing groups, so (?://|\\\\)
  • Steve Waring
    Steve Waring about 7 years
    I just made the same mistake, i ment (?://|\\\\\\\\)
  • ed22
    ed22 over 6 years
    doesn't work if url is in parenthesis (www.myurl.com) - returns "www.myurl.com)"
  • BullyWiiPlaza
    BullyWiiPlaza over 5 years
    Updates in regards to what?
  • Jonathan Morales Vélez
    Jonathan Morales Vélez over 5 years
    it doesn't work when string contains \n: Sources:\nhttps://sites.google.com/view/kgssourcesbeauty/sta‌​rtseite\n is not recognized as a link
  • parsecer
    parsecer over 4 years
    Big thank you. Your answer is a life-saver for regex newbies like myself.
  • mwarren
    mwarren over 2 years
    The linkedIn url-detector code did not work for me. It extracted a url from a bunch of russian words where there was a full-stop at the end of a sentence followed immediately by the beginning of the next sentence without a space after the dot. Here is what it found - дней.Не - It then added http:// in front, which wasn't there in the text. At the very least it should exclude non-latin characters right?