XPath Node to String

12,531

Solution 1

$xml = '<foo>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[@class='url']") as $node) echo $node->textContent;

Solution 2

You dont even need an XPath for this:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
    if(in_array('url', explode(' ', $span->getAttribute('class')))) {
        $span->nodeValue = $span->textContent;
    }
}
echo $dom->saveHTML();

EDIT after comment below

If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.


With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');

// Custom Callback function
function nodeTextJoin($nodes)
{
    $text = '';
    foreach($nodes as $node) {
        $text .= $node->textContent;
    }
    return $text;
}

Solution 3

Using XMLReader:

$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
    if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
        echo $xmlr->readString();
    }
}

Output:

word
test

word
test2
more words

Solution 4

SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:

foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
    // will not work as expected
    echo $span;

    // will work as expected
    echo textContent($span);
}

function textContent(SimpleXMLElement $node)
{
    return dom_import_simplexml($node)->textContent;
}
Share:
12,531
spyderman4g63
Author by

spyderman4g63

I'm not a real programmer I just hack stuff together.

Updated on June 13, 2022

Comments

  • spyderman4g63
    spyderman4g63 almost 2 years

    How can I select the string contents of the following nodes:

    <span class="url">
     word
     <b class=" ">test</b>
    </span>
    
    <span class="url">
     word
     <b class=" ">test2</b>
     more words
    </span>
    

    I have tried a few things

    //span/text()
    

    Doesn't get the bold tag

    //span/string(.)
    

    is invalid

    string(//span)
    

    only selects 1 node

    I am using simple_xml in php and the only other option I think is to use //span which returns:

    Array
    (
        [0] => SimpleXMLElement Object
            (
                [@attributes] => Array
                    (
                        [class] => url
                    )
    
                [b] => test
            )
    
        [1] => SimpleXMLElement Object
            (
                [@attributes] => Array
                    (
                        [class] => url
                    )
    
                [b] => test2
            )
    
    )
    

    *note that it is also dropping the "more words" text from the second span.

    So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.