Regexp for extracting all links and anchor texts from HTML
18,545
Solution 1
<?
$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
Solution 2
You need to take a look at look ahead and look behind.
<?php
$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';
if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
{
/*** if we find the word white, not followed by house ***/
echo 'Found a match';
print_r($matches);
}
else
{
/*** if no match is found ***/
echo 'No match found';
}
?>
Solution 3
<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>
This will extract both the link and the anchor text.
Solution 4
Try something like this:
//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
Solution 5
/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis
Comments
-
gregory boero.teyssier almost 2 years
I'd like one or more regexes that can:
1) Take the html of a large page.
2) Find the urls contained in all links, for example:
<a href="http://example1.com">Test 1</a> <a class="foo" id="bar" href="http://example2.com">Test 2</a> <a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>
And so on, it should extract the url contained in the
'href'
attribute regardless of what comes before or after thehref
3) Extract the anchor text of all links, for example in the above examples, it should return 'http://example1.com' and the anchor text 'Test 1', then 'http://example2.com' and 'Test 2', and so on.
-
Gordon over 13 yearsThis will break when the attribute value is enclosed in double quotes and contains single quotes. It will also break when quotes are omitted, which would be permissible for an href value like next_page.htm. See w3.org/TR/html401/intro/sgmltut.html#h-3.2.2
-
Gordon over 13 yearsThis wouldnt match second and third link in OP's given example markup.
-
Sergi over 13 yearsAnd of course, the correct way to do this is with the DOM parser, but it's also possible with regex.
-
Gordon over 13 yearsSee my comment below GameBit's solution. It applies to your Regex as well.
-
Sergi over 13 yearsNo, it won't break if there're single quotes inside the attributes, just try it. In fact if you use this regex #<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|<a.*(?=href='([^']*)')[^>]*>([^<]*)</a> |<a.*(?=href=([^\s]*)\s)[^>]*>([^<]*)</a>#i or something like that and you discard empty resultsets afterwards, it won't even break if you use single quotes or not quotes at all. The only way to break it is to use < in the anchor text, as I cannot use the look behind with unlimited characters (a PHP regex limitation) to check if it marks the end of the link or it's a single character inside the text
-
Oliver O'Neill over 13 yearsa lot of people just throw out the "Just use a DOM parser!" But none never show a quick example of what it can do. php.net/manual/en/book.dom.php It does a lot more than my example. Worth learning about.
-
giorgio79 over 11 yearsThis answer is incomplete, here is one that works stackoverflow.com/questions/4423272/…
-
d7samurai over 10 yearsthis one is pretty robust (test it here martinwardener.com/regex):
\b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))
-
KoalaBear over 7 yearsI use this one, because it only takes 54ms for 4MB file instead of 10-30 seconds with real parsers :)
-
kanudo about 7 yearsReally a great work just one regex and all work done. Learnt new way today.