Javascript: extract URLs from string (inc. querystring) and return array

14,515

Solution 1

I just use URI.js -- makes it easy.

var source = "Hello www.example.com,\n"
    + "http://google.com is a search engine, like http://www.bing.com\n"
    + "http://exämple.org/foo.html?baz=la#bumm is an IDN URL,\n"
    + "http://123.123.123.123/foo.html is IPv4 and "
    + "http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html is IPv6.\n"
    + "links can also be in parens (http://example.org) "
    + "or quotes »http://example.org«.";

var result = URI.withinString(source, function(url) {
    return "<a>" + url + "</a>";
});

/* result is:
Hello <a>www.example.com</a>,
<a>http://google.com</a> is a search engine, like <a>http://www.bing.com</a>
<a>http://exämple.org/foo.html?baz=la#bumm</a> is an IDN URL,
<a>http://123.123.123.123/foo.html</a> is IPv4 and <a>http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html</a> is IPv6.
links can also be in parens (<a>http://example.org</a>) or quotes »<a>http://example.org</a>«.
*/

Solution 2

You could use the regex from URI.js:

// gruber revised expression - http://rodneyrehm.de/t/url-regex.html
var uri_pattern = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/ig;

String#match and or String#replace may help…

Solution 3

Following regular expression extract URLs from string (inc. query string) and returns array

var url = "asdasdla hakjsdh aaskjdh https://www.google.com/search?q=add+a+element+to+dom+tree&oq=add+a+element+to+dom+tree&aqs=chrome..69i57.7462j1j1&sourceid=chrome&ie=UTF-8 askndajk nakjsdn aksjdnakjsdnkjsn";

var matches = strings.match(/\bhttps?::\/\/\S+/gi) || strings.match(/\bhttps?:\/\/\S+/gi);

Output:

["https://www.google.com/search?q=format+to+6+digir&…s=chrome..69i57.5983j1j1&sourceid=chrome&ie=UTF-8"]

Note: This handles both http:// with single colon and http::// with double colon in string, vice versa for https, So it's safe for you to use. :)

Solution 4

try this

var expression = /[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi;

you could use this website to test regexp http://gskinner.com/RegExr/

Share:
14,515
SW4
Author by

SW4

Updated on June 05, 2022

Comments

  • SW4
    SW4 almost 2 years

    I know this has been asked a thousand times before (apologies), but searching SO/Google etc I am yet to get a conclusive answer.

    Basically, I need a JS function which when passed a string, identifies & extracts all URLs based on a regex, returning an array of all found. e.g:

    function findUrls(searchText){
        var regex=???
        result= searchText.match(regex);
        if(result){return result;}else{return false;}
    }
    

    The function should be able to detect and return any potential urls. I am aware of the inherant difficulties/isses with this (closing parentheses etc), so I have a feeling the process needs to be:

    Split the string (searchText) into distinct sections starting/ending) with either nothing, a space or carriage return either side of it, resulting in distinct content chunks, e.g. do a split.

    For each content chunk that results from the split, see whether it fits the logic for a URL of any construction, namely, does it contain a period immediately followed the text (the one constant rule for qualifying a potential URL).

    The regex should see whether the period is immediately followed by other text, of the type allowable for a tld, directory structure & query string, and preceded by text of the allowable type for a URL.

    I am aware false positives may result, however any returned values will then be checked with a call to the URL itself, so this can be ignored. The other functions I have found often dont return the URLs query string too, if present.

    From a block of text, the function should thus be able to return any type of URL, even if it means identifying will.i.am as a valid one!

    eg. http://www.google.com, google.com, www.google.com, http://google.com, ftp.google.com, https:// etc...and any derivation thereof with a query string should be returned...

    Many thanks, apologies again if this exists elsewhere on SO but my searches havent returned it..

  • rodneyrehm
    rodneyrehm over 9 years
    Note that using a regex - this one in particular - can cause problems ("catastrophic backtracking") - github.com/medialize/URI.js/issues/131 - I'd go with @chovy's answer and use URI.withinString()
  • Martijn Hols
    Martijn Hols almost 6 years
    The regex in this answer is vulnerable to ReDoS from strings such as "[https://stackoverflow.com/questions/11209016/javascript-ex‌​tract-urls-from-stri‌​ng-inc-querystring-a‌​nd-return-array/1120‌​9098#11209098](https‌​://stackoverflow.com‌​/questions/11209016/‌​javascript-extract-u‌​rls-from-string-inc-‌​querystring-and-retu‌​rn-array/11209098#11‌​209098)"