How to get innerHTML of a node using scrapy Selector?

python html xpath css-selectors scrapy

10,060

Solution 1

Here's what I managed to do:

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

Assuming that html_string is a variable holding the html in your question, this code produces the following output:

   text in a

text in b


text in c




text in b

   text in a

text in c

The selector a *::text() matches all the text nodes which are descendents of a nodes.

Solution 2

You can use XPath's string() function on the elements you select:

$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
...    text in a
...    <b>text in b</b>
...    <c>text in c</c>
... </a>
... <a>
...    <b>text in b</b>
...    text in a
...    <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
...     print link.xpath('string(.)').extract()
... 
[u'\n   text in a\n   text in b\n   text in c\n']
[u'\n   text in b\n   text in a\n   text in c\n']
>>>

Solution 3

try this

response.xpath('//a/node()').extract()

10,060

Author by

kuixiong

Updated on July 03, 2022

Comments

kuixiong almost 2 years
Suppose there are some html fragments like:
```
<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>
```
In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!