Extracting URL link using regular expression re - string matching - Python

10,358
re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))

The [^\s<>"]+ part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:

<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>
Share:
10,358
Eternity
Author by

Eternity

Developer

Updated on June 26, 2022

Comments

  • Eternity
    Eternity almost 2 years

    I've been trying to extract URLs from a text file using re api. any link that starts with http:// , https:// and www.

    the file contains texts as well as html source code, html part is easy because i can extract them using BeautifulSoup, but normal text seems to be more challenging. I found this online which seems to be the best implementation of URL extraction however it fails on certain tags, specially it can't handle tags and includes them in the URL. any help is appreciated, because I'm not familiar with string matching at all myself

    here is the signature

    sp1=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", str(STRING))
    sp2=re.findall('www.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(STRING))
    

    examples:

    http://www.website.com/science/</span></a><o:p></o:p></span></div><div
    www.website.com/library/</span></a></span></i><span
    http://awebsite.com/Groups</a><div>
    
  • Eternity
    Eternity about 12 years
    awesome, Works like a champ :)..Thanks mate