How can I extract/parse a complete URL from a semi random string?

30,634

Solution 1

Did you try:

egrep -o 'https?://[^ ]+' foo_output

instead?

Note that anything with a character class is taken as literal, so saying [\w] doesn't match a word character. Moreover, you don't need to escape a regex metacharacter within a character class, i.e, saying [\.] isn't quite the same as [.].

Solution 2

URIs aren't well-suited for regular expression matching when embedded in natural language. However, the current state of the art is John Gruber's Improved Liberal, Accurate Regex Pattern for Matching URLs. As currently posted, the one-line version is as follows:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

John also appears to maintain a gist here, although his blog entry does a much better job of explaining his test corpus and the limitations of the regular expression pattern.

If you want to implement the expression from the command line, you may find yourself limited by the regular expression engine you're using or by shell quoting issues. I've found a Ruby script to be the best option, but your mileage may vary.

Solution 3

The problem with matching URLs is that just about anything can be in a URL:

https://encrypted.google.com/search?hl=en&q=foo#hl=en&q=foo&tbs=qdr:w,sbd:1

As you can see, the (valid) URL above contains $,?,#,&,,,. and :. Basically, the only thing you can be sure a URL does not contain is a blank space. With that in mind, you could extract your URLs with as simple a pattern as:

$ grep -oP 'http.?://\S+' file 
http://www.google.com
https://foo.com/category/example.html
http://bit.ly/~1223456677878
https://foo1234.net/report.jpg

The \S matches any non-space characters in perl compatible regular expressions (PCREs), the -P activates PCREs for grep and the -o makes it print only the matched segment of the line.

Share:
30,634
Mike B
Author by

Mike B

Updated on September 18, 2022

Comments

  • Mike B
    Mike B almost 2 years

    I'd like to have bash parse/extract a full URL (and only the url) from a random short string.

    Examples:

    bob, the address is http://www.google.com
    

    or

    https://foo.com/category/example.html is up
    

    or

    Error 123 occurred at http://bit.ly/~1223456677878
    

    or

    Stats are up: https://foo1234.net/report.jpg
    

    I tried using cat foo_output | egrep -o "https?://[\w'-\.]*\s" but that didn't seem to work.

  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    [^ ] is too wide, you'll want to exclude other blanks, (, ), possibly comas, and all the characters that are not allowed in URLs.
  • devnull
    devnull over 10 years
    @StephaneChazelas You're right. However, I assumed that the URL is preceded and followed by a space unless at the beginning or the end of line.
  • terdon
    terdon over 10 years
    Please include the regex in your answer instead of linking to it.
  • vonbrand
    vonbrand over 10 years
    @terdon, the full regexp is some 60 lines.
  • terdon
    terdon over 10 years
    @vonbrand I know, I saw it. We just tend to avoid linking to external resources. The whole point of the SE sites is to be a wiki. What if the blog you linked to goes offline? Your answer will become useless. Anyway, 60 lines is not that much and it is only 60 lines for readability.
  • Jeff Schaller
    Jeff Schaller about 8 years
    If you have an improvement over an existing answer, you can refer back to via the "share" link under that answer. See also the help pages
  • chovy
    chovy over 2 years
    this shows " at end of urls
  • chovy
    chovy over 2 years
    need to add " to ignore list.