How to get innerHTML of a node using scrapy Selector?
10,060
Solution 1
Here's what I managed to do:
from scrapy.selector import Selector
sel = Selector(text = html_string)
for node in sel.css('a *::text'):
print node.extract()
Assuming that html_string
is a variable holding the html in your question, this code produces the following output:
text in a
text in b
text in c
text in b
text in a
text in c
The selector a *::text()
matches all the text nodes which are descendents of a
nodes.
Solution 2
You can use XPath's string()
function on the elements you select:
$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
... text in a
... <b>text in b</b>
... <c>text in c</c>
... </a>
... <a>
... <b>text in b</b>
... text in a
... <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
... print link.xpath('string(.)').extract()
...
[u'\n text in a\n text in b\n text in c\n']
[u'\n text in b\n text in a\n text in c\n']
>>>
Solution 3
try this
response.xpath('//a/node()').extract()
Author by
kuixiong
Updated on July 03, 2022Comments
-
kuixiong almost 2 years
Suppose there are some html fragments like:
<a> text in a <b>text in b</b> <c>text in c</c> </a> <a> <b>text in b</b> text in a <c>text in c</c> </a>
In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!