How to return result as HTML with HtmlXPathSelector (Scrapy)
Solution 1
Call .extract()
on your XpathSelectorList
. It shall return a list of unicode strings contains the HTML content you want.
hxs.select('//div[@id="leexample"]/*').extract()
Update
# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()
/html()
is not a valid scrapy selector. To extract all children, use '//div[@id="leexample"]/*'
or '//div[@id="leexample"]/node()'
. Note that, node()
will return textNode
, the result kind of like:
[u'\n ', u'<a href="image1.html">Name: My image 1
' ]
Solution 2
Use:
//span[@class="title"]/node()
this selects all nodes (elements, text-nodes, processing-instructions and comments) that are children of any span
element in the XML document whose class
attribute has the value "title"
.
If you want to get only the children-nodes of the first such span
in the document, use:
(//span[@class="title"])[1]/node()
Solution 3
Though late I leave this for the record.
What I do is:
html = ''.join(hxs.select('//span[@class="title"]/node()').extract())
Or if we want to match various nodes:
elements = hxs.select('//span[@class="title"]')
html = [''.join(e) for e in elements.select('./node()')]
mirandalol
Updated on August 19, 2022Comments
-
mirandalol over 1 year
How do I retrieve all the HTML contained inside a tag?
hxs = HtmlXPathSelector(response) element = hxs.select('//span[@class="title"]/')
Perhaps something like:
hxs.select('//span[@class="title"]/html()')
EDIT: If I look at the documentation, I see only methods to return a new
XPathSelectorList
, or just the raw text inside a tag. I want to retrieve not a new list or just text, but the source code HTML inside a tag. e.g.:<html> <head> <title></title> </head> <body> <div id="leexample"> justtext <p class="ihatelookingforfeatures"> sometext </p> <p class="yahc"> sometext </p> </div> <div id="lenot"> blabla </div> an awfuly long example for this. </body> </html>
I want to do a method like such
hxs.select('//div[@id="leexample"]/html()')
that shall return me the HTML inside of it, like this:justtext <p class="ihatelookingforfeatures"> sometext </p> <p class="yahc"> sometext </p>
I hope I cleared the ambiguousness around my question.
How to get the HTML from an
HtmlXPathSelector
in Scrapy? (perhaps a solution outside scrapy's scope?) -
mirandalol almost 12 yearsIt's nice, but not what I asked. it returns a list of elements -> I need the HTML behind those elements. not nodes => plain HTML.
-
warvariuc almost 12 yearsIf xiaowl's answer was helptful, please accept/upvote his answer.
-
Dimitre Novatchev almost 12 years@Saga: This cannot be done with XPath -- you need within the progamming language that hosts XPath to use a particular DOM method/property (such as
OuterXML
orInnerXml
-- or these may be namedOuterHtml
/InnerHtml
-- or in other DOM --node.Save()
) -
Sjaak Trekhaak almost 12 years/html() is not supported, i'm not even sure if its valid. Scrapy will throw:
ValueError: Invalid XPath: //h1/html()
-
Kangur about 8 yearsWatchout:
//span[@class="title"]/node()
will fail if there are multiple classes. Join it with css selector to select elements with given class:parent.css('.title').xpath('node()')
-
Dimitre Novatchev about 8 years@Kangur, CSS is not needed. See this answer explaining how to determine an element has a class, that may appear with other class-names: stackoverflow.com/a/35354908/36305