Getting all attributes from an <a> HTML tag with regex

13,189

Solution 1

You can build on that regex. Have a look:

'/<a(?:\s+(?:href=["\'](?P<href>[^"\'<>]+)["\']|title=["\'](?P<title>[^"\'<>]+)["\']|\w+=["\'][^"\'<>]+["\']))+/i'

...or in human-readable form:

preg_match_all(
    '/<a
    (?:\s+
      (?:
         href=["\'](?P<href>[^"\'<>]+)["\']
        |
         title=["\'](?P<title>[^"\'<>]+)["\']
        |
         \w+=["\'][^"\'<>]+["\']
      )
    )+/ix', 
    $subject, $result, PREG_PATTERN_ORDER);

Pretty self explanatory, I think. Note that your original regex has the same problem vis-à-vis order of appearance. For example, it would fail to match this tag:

<a class="someclass" href="somepage.html">link text</a>

Unless you're absolutely sure there will be no other attributes, you can't reasonably expect href to be listed first. You can use the same gimmick as above, where the second branch silently consumes and discards the attributes that don't interest you:

    '/<a
    (?:\s+
      (?:
         href=["\'](?P<href>[^"\'<>]+)["\']
        |
         \w+=["\'][^"\'<>]+["\']
      )
    )+/ix', 

Solution 2

Try this regextrainer I made a while back.

The sample contains a pattern like this: <([^ ]+) ?([^>]*)>([^<]*)< ?/ ?\1> which will capture attributes in html.

I see now that it doesn't extract the attribute name and value, just the whole attribute text itself. Use this to extract the attribute details: ((([^=]+)=((?:"|'))([^"']+)\4) ?)+

Share:
13,189
SISYN
Author by

SISYN

Started learning web development in 2002 and it's been a fun journey from there. I am now creating a website that will serve as a resource for all my open-source code as well as a general portfolio of my work. If this interests you feel free to check it out at sisyn.com. Take care. -- Dan

Updated on June 05, 2022

Comments

  • SISYN
    SISYN almost 2 years

    I already have a function that retrieves the href attribute from all of the a tags on a given page of markup. However, I would also like to retrieve other attributes, namely the title attribute.

    I have a feeling it's a simple modification of the regular expression that I'm already using, but my only concern is the order of appearance in the markup. If I have a link with this code:

    <a href="somepage.html" title="My Page">link text</a>
    

    I want it to be parsed the same and not cause any errors even if it appears like this:

    <a title="My Page" href="somepage.html">link text</a>
    

    Here is my processing function:

    function getLinks($src) {
        if(preg_match_all('/<a\s+href=["\']([^"\']+)["\']/i', $src, $links, PREG_PATTERN_ORDER))
            return array_unique($links[1]);
        return false;
    }
    

    Would I have to use another regex all together, or would it be possible to modify this one so that the title attribute is stored in the same array of returned data as the href attribute?