Extract class name in scrapy

python web-scraping css-selectors scrapy

11,029

Solution 1

You could use a combination of both somewhere in your code:

import re

classes = response.css('.star-rating').xpath("@class").extract()
for cls in classes:
    match = re.search(r'\bcount-\d+\b', cls)
    if match:
        print("Class = {}".format(match.group(0))

Solution 2

You can extract rating directly using re_first() and re():

for rating in response.xpath('//div[contains(@class, "star-rating")]/@class').re(r'count-(\d+)'):
    print(rating)

11,029

Author by

Dan

I am a data scientist working with Python in London. I previously worked in quantitative finance in South Africa working in MATLAB.

Updated on June 15, 2022

Comments

Dan almost 2 years
I am trying to scrape rating off of trustpilot.com.

Is it possible to extract a class name using scrapy? I am trying to scrape a rating which is made up of five individual images but the images are in a class with the name of the rating for example if the rating is 2 starts then:
```
<div class="star-rating count-2 size-medium clearfix">...
```
if it is 3 stars then:
```
<div class="star-rating count-3 size-medium clearfix">...
```
So is there a way I can scrape the class count-2 or count-3 assuming a selector like .css('.star-rating')?
- Jan over 6 years
  
  You could combine it with an xpath like response.css('.star-rating').xpath("@class").extract() (not tested).
- Dan over 6 years
  
  Thanks, that returns ['star-rating count-4 size-medium clearfix'] which is close enough to get something working. But do you know if I can use xpath to only get the classes starting with count-?
- Jan over 6 years
  
  You could try: response.css('.star-rating').xpath(".//[contains(@class, 'count-')]/@class").extract()
- Dan over 6 years
  
  That errored, but this sort of hack works response.css('.star-rating').xpath('./@class').extract()[0].‌split(' ')[1][-1]
- Jan over 6 years
  
  Otherwise please give a demo link.
- RabidCicada over 6 years
  
  Dan I'm fairly certain that xpath1 only operates on nodes in the dom. scrapy uses lxml which only implements xpath1. xpath2 has some nifty functions like matches, tokenize, and replace that you could use to directly get what you want. Otherwise Jan's answer is the best you will get
Dan about 6 years

Thanks, ended up combining the two answers to response.css('.star-rating').xpath("@class").re(r'count-(\d)‌')[0]
Dan about 6 years

Thanks, ended up combining the two answers to response.css('.star-rating').xpath("@class").re(r'count-(\d)‌')[0]
gangabass about 6 years

@Dan You'll get an exception on pages without rating ([0] will not work for None)
Dan about 6 years

Thanks, but it looks like 1 star is the lowest allowed.