How can I extract/parse a complete URL from a semi random string?

bash scripting regular-expression

30,634

Solution 1

Did you try:

egrep -o 'https?://[^ ]+' foo_output

instead?

Note that anything with a character class is taken as literal, so saying [\w] doesn't match a word character. Moreover, you don't need to escape a regex metacharacter within a character class, i.e, saying [\.] isn't quite the same as [.].

Solution 2

URIs aren't well-suited for regular expression matching when embedded in natural language. However, the current state of the art is John Gruber's Improved Liberal, Accurate Regex Pattern for Matching URLs. As currently posted, the one-line version is as follows:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

John also appears to maintain a gist here, although his blog entry does a much better job of explaining his test corpus and the limitations of the regular expression pattern.

If you want to implement the expression from the command line, you may find yourself limited by the regular expression engine you're using or by shell quoting issues. I've found a Ruby script to be the best option, but your mileage may vary.

Solution 3

The problem with matching URLs is that just about anything can be in a URL:

https://encrypted.google.com/search?hl=en&q=foo#hl=en&q=foo&tbs=qdr:w,sbd:1

As you can see, the (valid) URL above contains $,?,#,&,,,. and :. Basically, the only thing you can be sure a URL does not contain is a blank space. With that in mind, you could extract your URLs with as simple a pattern as:

$ grep -oP 'http.?://\S+' file 
http://www.google.com
https://foo.com/category/example.html
http://bit.ly/~1223456677878
https://foo1234.net/report.jpg

The \S matches any non-space characters in perl compatible regular expressions (PCREs), the -P activates PCREs for grep and the -o makes it print only the matched segment of the line.

30,634

Author by

Mike B

Updated on September 18, 2022

Comments

Mike B almost 2 years
I'd like to have bash parse/extract a full URL (and only the url) from a random short string.

Examples:
```
bob, the address is http://www.google.com
```
or
```
https://foo.com/category/example.html is up
```
or
```
Error 123 occurred at http://bit.ly/~1223456677878
```
or
```
Stats are up: https://foo1234.net/report.jpg
```
I tried using cat foo_output | egrep -o "https?://[\w'-\.]*\s" but that didn't seem to work.
Stéphane Chazelas over 10 years

[^ ] is too wide, you'll want to exclude other blanks, (, ), possibly comas, and all the characters that are not allowed in URLs.
devnull over 10 years

@StephaneChazelas You're right. However, I assumed that the URL is preceded and followed by a space unless at the beginning or the end of line.
terdon over 10 years

Please include the regex in your answer instead of linking to it.
vonbrand over 10 years

@terdon, the full regexp is some 60 lines.
terdon over 10 years

@vonbrand I know, I saw it. We just tend to avoid linking to external resources. The whole point of the SE sites is to be a wiki. What if the blog you linked to goes offline? Your answer will become useless. Anyway, 60 lines is not that much and it is only 60 lines for readability.
Jeff Schaller about 8 years

If you have an improvement over an existing answer, you can refer back to via the "share" link under that answer. See also the help pages
chovy over 2 years

this shows " at end of urls
chovy over 2 years

need to add " to ignore list.