Scrapy: extracting data from an html tag that uses an "id" Selector instead of a "class"

web-scraping scrapy

13,212

Solution 1

This is one way.

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('[id="id_A"]::text').extract()
['Hello, Earth']

Alternatively,

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('span#id_A::text').extract()
['Hello, Earth']

Scrapy uses cssselect which follows W3 Selectors Level 3

Solution 2

The problem is that you're using a "class selector" (please check this for reference). You should really use an "id selector", this should work:

response.css('#id_A::text').extract()

13,212

Author by

RF_956

Updated on June 09, 2022

Comments

RF_956 almost 2 years
I am new to web scraping and Scrapy. I hope you can help me.

I am trying to extract data from a web page where it uses tag. Usually, if the span tag is using a class, for example:
```
<span class="class_A>Hello, World!</span>
```
I would use the following code to retrieve the text.
```
request.css('span.class_A::text').extract()
```
However, when an html is now using an "id" instead of a "class", for example,
```
<span id="id_A>Hello, Universe!</span>
```
the code below does not work anymore.
```
request.css('span.id_A::text').extract()
```
Please help! What's the correct way of extracting data using an "id".

Thank you for your help!