How can I extract/parse a complete URL from a semi random string?
Solution 1
Did you try:
egrep -o 'https?://[^ ]+' foo_output
instead?
Note that anything with a character class is taken as literal, so saying [\w]
doesn't match a word character. Moreover, you don't need to escape a regex metacharacter within a character class, i.e, saying [\.]
isn't quite the same as [.]
.
Solution 2
URIs aren't well-suited for regular expression matching when embedded in natural language. However, the current state of the art is John Gruber's Improved Liberal, Accurate Regex Pattern for Matching URLs. As currently posted, the one-line version is as follows:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
John also appears to maintain a gist here, although his blog entry does a much better job of explaining his test corpus and the limitations of the regular expression pattern.
If you want to implement the expression from the command line, you may find yourself limited by the regular expression engine you're using or by shell quoting issues. I've found a Ruby script to be the best option, but your mileage may vary.
Solution 3
The problem with matching URLs is that just about anything can be in a URL:
https://encrypted.google.com/search?hl=en&q=foo#hl=en&q=foo&tbs=qdr:w,sbd:1
As you can see, the (valid) URL above contains $
,?
,#
,&
,,
,.
and :
. Basically, the only thing you can be sure a URL does not contain is a blank space. With that in mind, you could extract your URLs with as simple a pattern as:
$ grep -oP 'http.?://\S+' file
http://www.google.com
https://foo.com/category/example.html
http://bit.ly/~1223456677878
https://foo1234.net/report.jpg
The \S
matches any non-space characters in perl compatible regular expressions (PCREs), the -P
activates PCREs for grep
and the -o
makes it print only the matched segment of the line.
![Mike B](https://i.stack.imgur.com/epRCx.jpg?s=256&g=1)
Mike B
Updated on September 18, 2022Comments
-
Mike B almost 2 years
I'd like to have bash parse/extract a full URL (and only the url) from a random short string.
Examples:
bob, the address is http://www.google.com
or
https://foo.com/category/example.html is up
or
Error 123 occurred at http://bit.ly/~1223456677878
or
Stats are up: https://foo1234.net/report.jpg
I tried using
cat foo_output | egrep -o "https?://[\w'-\.]*\s"
but that didn't seem to work. -
Stéphane Chazelas over 10 years
[^ ]
is too wide, you'll want to exclude other blanks,(
,)
, possibly comas, and all the characters that are not allowed in URLs. -
devnull over 10 years@StephaneChazelas You're right. However, I assumed that the URL is preceded and followed by a space unless at the beginning or the end of line.
-
terdon over 10 yearsPlease include the regex in your answer instead of linking to it.
-
vonbrand over 10 years@terdon, the full regexp is some 60 lines.
-
terdon over 10 years@vonbrand I know, I saw it. We just tend to avoid linking to external resources. The whole point of the SE sites is to be a wiki. What if the blog you linked to goes offline? Your answer will become useless. Anyway, 60 lines is not that much and it is only 60 lines for readability.
-
Jeff Schaller about 8 yearsIf you have an improvement over an existing answer, you can refer back to via the "share" link under that answer. See also the help pages
-
chovy over 2 yearsthis shows
"
at end of urls -
chovy over 2 yearsneed to add
"
to ignore list.