Detect and extract url from a string?
Solution 1
m.group(1) gives you the first matching group, that is to say the first capturing parenthesis. Here it's (https?|ftp|file)
You should try to see if there is something in m.group(0), or surround all your pattern with parenthesis and use m.group(1) again.
You need to repeat your find function to match the next one and use the new group array.
Solution 2
Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases. Trying to write the perfect expression for something like this is very difficult. That said, I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes. Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found. It's definitely not trivial:
// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
"(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
+ "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
+ "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Here's an example of using it:
Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
int matchStart = matcher.start(1);
int matchEnd = matcher.end();
// now you have the offsets of a URL match
}
Solution 3
/**
* Returns a list with all links contained in the input
*/
public static List<String> extractUrls(String text)
{
List<String> containedUrls = new ArrayList<String>();
String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
Matcher urlMatcher = pattern.matcher(text);
while (urlMatcher.find())
{
containedUrls.add(text.substring(urlMatcher.start(0),
urlMatcher.end(0)));
}
return containedUrls;
}
Example:
List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");
for (String url : extractedUrls)
{
System.out.println(url);
}
Prints:
https://stackoverflow.com/
http://www.google.com/
Solution 4
Detecting URLs is not an easy task. If its enough for you to get a string that starts with https?|ftp|file then it could be fine. Your problem here is, that you have a capturing group, the ()
and those are only around the first part http...
I would make this part a non capturing group using (?:) and put brackets around the whole thing.
"\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"
Solution 5
With some extra brackets around the whole thing (except word boundary at start) it should match the whole domain name:
"\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"
I don't think that regex matches the whole url though.
Shisoft
Studies in Donghua University Founder of Shisoft,A web service integrate team.
Updated on March 25, 2021Comments
-
Shisoft about 3 years
This is a easy question,but I just don't get it. I want to detect url in a string and replace them with a shorten one.
I found this expression from stackoverflow,But the result is just
http
Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(str); boolean result = m.find(); while (result) { for (int i = 1; i <= m.groupCount(); i++) { String url=m.group(i); str = str.replace(url, shorten(url)); } result = m.find(); } return html;
Is there any better idea?
-
Christian Brüggemann over 8 yearsThis one unfortunately also matches a dot following the URL.
-
Thomas Wana about 8 yearsDoesn't handle URLs in text correctly. Preceding whitespace is incorrectly handled (newlines swallowed), and accepts colons, dots etc after the URL.
-
Abdullah Khan over 7 yearsThis works even for trailing commas and whitespaces.. Great
-
4gus71n over 7 yearsDoesn't work with something like
<a href="www.google.com">google link</a>
It returns"www.google.com
-
Steve Waring about 7 yearsDownvoted because there should be eight backslashes not four. Putting them inside double quotes reduces the number of backslashes to four in the string. The regex interpretation of \\ to match a single \ reduces the number to two which is what you are trying to match. Also you can use none captureing groups, so
(?://|\\\\)
-
Steve Waring about 7 yearsI just made the same mistake, i ment
(?://|\\\\\\\\)
-
ed22 over 6 yearsdoesn't work if url is in parenthesis (www.myurl.com) - returns "www.myurl.com)"
-
BullyWiiPlaza over 5 yearsUpdates in regards to what?
-
Jonathan Morales Vélez over 5 yearsit doesn't work when string contains
\n
:Sources:\nhttps://sites.google.com/view/kgssourcesbeauty/startseite\n
is not recognized as a link -
parsecer over 4 yearsBig thank you. Your answer is a life-saver for regex newbies like myself.
-
mwarren over 2 yearsThe linkedIn url-detector code did not work for me. It extracted a url from a bunch of russian words where there was a full-stop at the end of a sentence followed immediately by the beginning of the next sentence without a space after the dot. Here is what it found - дней.Не - It then added http:// in front, which wasn't there in the text. At the very least it should exclude non-latin characters right?