Scrapy: extracting data from an html tag that uses an "id" Selector instead of a "class"

13,212

Solution 1

This is one way.

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('[id="id_A"]::text').extract()
['Hello, Earth']

Alternatively,

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('span#id_A::text').extract()
['Hello, Earth']

Scrapy uses cssselect which follows W3 Selectors Level 3

Solution 2

The problem is that you're using a "class selector" (please check this for reference). You should really use an "id selector", this should work:

response.css('#id_A::text').extract()
Share:
13,212
RF_956
Author by

RF_956

Updated on June 09, 2022

Comments

  • RF_956
    RF_956 almost 2 years

    I am new to web scraping and Scrapy. I hope you can help me.

    I am trying to extract data from a web page where it uses tag. Usually, if the span tag is using a class, for example:

    <span class="class_A>Hello, World!</span>
    

    I would use the following code to retrieve the text.

    request.css('span.class_A::text').extract()
    

    However, when an html is now using an "id" instead of a "class", for example,

    <span id="id_A>Hello, Universe!</span>
    

    the code below does not work anymore.

    request.css('span.id_A::text').extract()
    

    Please help! What's the correct way of extracting data using an "id".

    Thank you for your help!