How to return result as HTML with HtmlXPathSelector (Scrapy)

13,691

Solution 1

Call .extract() on your XpathSelectorList. It shall return a list of unicode strings contains the HTML content you want.

hxs.select('//div[@id="leexample"]/*').extract()

Update

# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()

/html() is not a valid scrapy selector. To extract all children, use '//div[@id="leexample"]/*' or '//div[@id="leexample"]/node()'. Note that, node() will return textNode, the result kind of like:

[u'\n   ',
 u'&lta href="image1.html">Name: My image 1 
' ]

Solution 2

Use:

//span[@class="title"]/node()

this selects all nodes (elements, text-nodes, processing-instructions and comments) that are children of any span element in the XML document whose class attribute has the value "title".

If you want to get only the children-nodes of the first such span in the document, use:

(//span[@class="title"])[1]/node()

Solution 3

Though late I leave this for the record.

What I do is:

html = ''.join(hxs.select('//span[@class="title"]/node()').extract())

Or if we want to match various nodes:

elements = hxs.select('//span[@class="title"]')
html = [''.join(e) for e in elements.select('./node()')]
Share:
13,691
mirandalol
Author by

mirandalol

Updated on August 19, 2022

Comments

  • mirandalol
    mirandalol over 1 year

    How do I retrieve all the HTML contained inside a tag?

    hxs = HtmlXPathSelector(response)
    element = hxs.select('//span[@class="title"]/')
    

    Perhaps something like:

    hxs.select('//span[@class="title"]/html()')
    

    EDIT: If I look at the documentation, I see only methods to return a new XPathSelectorList, or just the raw text inside a tag. I want to retrieve not a new list or just text, but the source code HTML inside a tag. e.g.:

    <html>
        <head>
            <title></title>
        </head>
        <body>
            <div id="leexample">
                justtext
                <p class="ihatelookingforfeatures">
                    sometext
                </p>
                <p class="yahc">
                    sometext
                </p>
            </div>
            <div id="lenot">
                blabla
            </div>
        an awfuly long example for this.
        </body>
    </html>
    

    I want to do a method like such hxs.select('//div[@id="leexample"]/html()') that shall return me the HTML inside of it, like this:

    justtext
    <p class="ihatelookingforfeatures">
        sometext
    </p>
    <p class="yahc">
        sometext
    </p>
    

    I hope I cleared the ambiguousness around my question.

    How to get the HTML from an HtmlXPathSelector in Scrapy? (perhaps a solution outside scrapy's scope?)

  • mirandalol
    mirandalol almost 12 years
    It's nice, but not what I asked. it returns a list of elements -> I need the HTML behind those elements. not nodes => plain HTML.
  • warvariuc
    warvariuc almost 12 years
    If xiaowl's answer was helptful, please accept/upvote his answer.
  • Dimitre Novatchev
    Dimitre Novatchev almost 12 years
    @Saga: This cannot be done with XPath -- you need within the progamming language that hosts XPath to use a particular DOM method/property (such as OuterXML or InnerXml -- or these may be named OuterHtml / InnerHtml -- or in other DOM -- node.Save())
  • Sjaak Trekhaak
    Sjaak Trekhaak almost 12 years
    /html() is not supported, i'm not even sure if its valid. Scrapy will throw: ValueError: Invalid XPath: //h1/html()
  • Kangur
    Kangur about 8 years
    Watchout: //span[@class="title"]/node() will fail if there are multiple classes. Join it with css selector to select elements with given class: parent.css('.title').xpath('node()')
  • Dimitre Novatchev
    Dimitre Novatchev about 8 years
    @Kangur, CSS is not needed. See this answer explaining how to determine an element has a class, that may appear with other class-names: stackoverflow.com/a/35354908/36305