Regexp for extracting all links and anchor texts from HTML

18,545

Solution 1

<?

$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');

Solution 2

You need to take a look at look ahead and look behind.

<?php

$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';

if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
        {
        /*** if we find the word white, not followed by house ***/
        echo 'Found a match';
        print_r($matches);
    }
else
        {
        /*** if no match is found ***/
        echo 'No match found';
        }
?>

Solution 3

<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>

This will extract both the link and the anchor text.

Solution 4

Try something like this:

//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

Solution 5

/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis
Share:
18,545
gregory boero.teyssier
Author by

gregory boero.teyssier

https://ali.actor

Updated on June 14, 2022

Comments

  • gregory boero.teyssier
    gregory boero.teyssier almost 2 years

    I'd like one or more regexes that can:

    1) Take the html of a large page.

    2) Find the urls contained in all links, for example:

    <a href="http://example1.com">Test 1</a>
    <a class="foo" id="bar" href="http://example2.com">Test 2</a>
    <a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>
    

    And so on, it should extract the url contained in the 'href'attribute regardless of what comes before or after the href

    3) Extract the anchor text of all links, for example in the above examples, it should return 'http://example1.com' and the anchor text 'Test 1', then 'http://example2.com' and 'Test 2', and so on.

  • Gordon
    Gordon over 13 years
    This will break when the attribute value is enclosed in double quotes and contains single quotes. It will also break when quotes are omitted, which would be permissible for an href value like next_page.htm. See w3.org/TR/html401/intro/sgmltut.html#h-3.2.2
  • Gordon
    Gordon over 13 years
    This wouldnt match second and third link in OP's given example markup.
  • Sergi
    Sergi over 13 years
    And of course, the correct way to do this is with the DOM parser, but it's also possible with regex.
  • Gordon
    Gordon over 13 years
    See my comment below GameBit's solution. It applies to your Regex as well.
  • Sergi
    Sergi over 13 years
    No, it won't break if there're single quotes inside the attributes, just try it. In fact if you use this regex #<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|<a.*(?=href='([^‌​']*)')[^>]*>([^<]*)<‌​/a> |<a.*(?=href=([^\s]*)\s)[^>]*>([^<]*)</a>#i or something like that and you discard empty resultsets afterwards, it won't even break if you use single quotes or not quotes at all. The only way to break it is to use < in the anchor text, as I cannot use the look behind with unlimited characters (a PHP regex limitation) to check if it marks the end of the link or it's a single character inside the text
  • Oliver O'Neill
    Oliver O'Neill over 13 years
    a lot of people just throw out the "Just use a DOM parser!" But none never show a quick example of what it can do. php.net/manual/en/book.dom.php It does a lot more than my example. Worth learning about.
  • giorgio79
    giorgio79 over 11 years
    This answer is incomplete, here is one that works stackoverflow.com/questions/4423272/…
  • d7samurai
    d7samurai over 10 years
    this one is pretty robust (test it here martinwardener.com/regex): \b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))
  • KoalaBear
    KoalaBear over 7 years
    I use this one, because it only takes 54ms for 4MB file instead of 10-30 seconds with real parsers :)
  • kanudo
    kanudo about 7 years
    Really a great work just one regex and all work done. Learnt new way today.