Regexp for extracting all links and anchor texts from HTML

php regex string html-parsing

18,545

Solution 1

<?

$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');

Solution 2

You need to take a look at look ahead and look behind.

<?php

$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';

if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
        {
        /*** if we find the word white, not followed by house ***/
        echo 'Found a match';
        print_r($matches);
    }
else
        {
        /*** if no match is found ***/
        echo 'No match found';
        }
?>

Solution 3

<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>

This will extract both the link and the anchor text.

Solution 4

Try something like this:

//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

Solution 5

/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis

View more solutions

18,545

Author by

gregory boero.teyssier

https://ali.actor

Updated on June 14, 2022

Comments

gregory boero.teyssier almost 2 years
I'd like one or more regexes that can:

1) Take the html of a large page.

2) Find the urls contained in all links, for example:
```
<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>
```
And so on, it should extract the url contained in the 'href'attribute regardless of what comes before or after the href

3) Extract the anchor text of all links, for example in the above examples, it should return 'http://example1.com' and the anchor text 'Test 1', then 'http://example2.com' and 'Test 2', and so on.
Gordon over 13 years

This will break when the attribute value is enclosed in double quotes and contains single quotes. It will also break when quotes are omitted, which would be permissible for an href value like next_page.htm. See w3.org/TR/html401/intro/sgmltut.html#h-3.2.2
Gordon over 13 years

This wouldnt match second and third link in OP's given example markup.
Sergi over 13 years

And of course, the correct way to do this is with the DOM parser, but it's also possible with regex.
Gordon over 13 years

See my comment below GameBit's solution. It applies to your Regex as well.
Sergi over 13 years

No, it won't break if there're single quotes inside the attributes, just try it. In fact if you use this regex #<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|<a.*(?=href='([^‌']*)')[^>]*>([^<]*)<‌/a> |<a.*(?=href=([^\s]*)\s)[^>]*>([^<]*)</a>#i or something like that and you discard empty resultsets afterwards, it won't even break if you use single quotes or not quotes at all. The only way to break it is to use < in the anchor text, as I cannot use the look behind with unlimited characters (a PHP regex limitation) to check if it marks the end of the link or it's a single character inside the text
Oliver O'Neill over 13 years

a lot of people just throw out the "Just use a DOM parser!" But none never show a quick example of what it can do. php.net/manual/en/book.dom.php It does a lot more than my example. Worth learning about.
giorgio79 over 11 years

This answer is incomplete, here is one that works stackoverflow.com/questions/4423272/…
d7samurai over 10 years

this one is pretty robust (test it here martinwardener.com/regex): \b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *$ *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>$)
KoalaBear over 7 years

I use this one, because it only takes 54ms for 4MB file instead of 10-30 seconds with real parsers :)
kanudo about 7 years

Really a great work just one regex and all work done. Learnt new way today.