Get all urls in a string with php

13,269
    /**
     *
     * @get URLs from string (string maybe a url)
     *
     * @param string $string

     * @return array
     *
     */
    function getUrls($string) {
        $regex = '/https?\:\/\/[^\" ]+/i';
        preg_match_all($regex, $string, $matches);
        //return (array_reverse($matches[0]));
        return ($matches[0]);
}
Share:
13,269
Bill
Author by

Bill

Updated on June 15, 2022

Comments

  • Bill
    Bill almost 2 years

    I'm trying to figure out a way to get an array of URLs from a string of text. The text will be somewhat formatted like this:

    Some random text up here

    http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

    http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/

    Obviously, those links can be anything (and there can be many links, those are just the ones I'm testing with now. If I use a simple URL like my regex works fine.

    I am using:

    preg_match_all('((https?|ftp|gopher|telnet|file|notes|ms-help):'.
        '((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)',
        $bodyMessage, $matches, PREG_PATTERN_ORDER);
    

    When I do a print_r( $matches); the result I get is:

    Array ( [0] => Array (
        [0] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon=
        [1] => http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= 
        [2] => http://techcrunch.co=
        [3] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-ip= 
        [4] => http://techcrunch.com/2012/07/20/last-day-to-purc=
        [5] => http://tec=
    )
    ...
    

    None of those items in that array are full links from the links above.

    Anyone know of a good way to get what I need? I've found a bunch of regex stuff to get links for PHP, but none of it works.

    Thanks!

    Edit:

    Ok, so i'm pulling these links from an e-mail. The script parses the email, grabs the body of the message, and then tries to grab the links from that. After investigating the email, it appears as if it is for some reason adding a space in the middle of the url. Here is the output of the body message as seen by my PHP script.

     --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 
    

    Any suggestions on how to make it not break the URLS?

    EDIT 2

    As per Laurnet's suggestion, I ran this code:

     $bodyMessage = str_replace("= ", "",$bodyMessage);
    

    However when I echo that out, it doesn't seem to want to replace "= "

     --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 
    
  • unloco
    unloco about 9 years
    You should also add the new line to the negation $regex = '/https?\:\/\/[^\" \n]+/i';