Extract class name in scrapy

11,029

Solution 1

You could use a combination of both somewhere in your code:

import re

classes = response.css('.star-rating').xpath("@class").extract()
for cls in classes:
    match = re.search(r'\bcount-\d+\b', cls)
    if match:
        print("Class = {}".format(match.group(0))

Solution 2

You can extract rating directly using re_first() and re():

for rating in response.xpath('//div[contains(@class, "star-rating")]/@class').re(r'count-(\d+)'):
    print(rating)
Share:
11,029
Dan
Author by

Dan

I am a data scientist working with Python in London. I previously worked in quantitative finance in South Africa working in MATLAB.

Updated on June 15, 2022

Comments

  • Dan
    Dan almost 2 years

    I am trying to scrape rating off of trustpilot.com.

    Is it possible to extract a class name using scrapy? I am trying to scrape a rating which is made up of five individual images but the images are in a class with the name of the rating for example if the rating is 2 starts then:

    <div class="star-rating count-2 size-medium clearfix">...
    

    if it is 3 stars then:

    <div class="star-rating count-3 size-medium clearfix">...
    

    So is there a way I can scrape the class count-2 or count-3 assuming a selector like .css('.star-rating')?

    • Jan
      Jan over 6 years
      You could combine it with an xpath like response.css('.star-rating').xpath("@class").extract() (not tested).
    • Dan
      Dan over 6 years
      Thanks, that returns ['star-rating count-4 size-medium clearfix'] which is close enough to get something working. But do you know if I can use xpath to only get the classes starting with count-?
    • Jan
      Jan over 6 years
      You could try: response.css('.star-rating').xpath(".//[contains(@class, 'count-')]/@class").extract()
    • Dan
      Dan over 6 years
      That errored, but this sort of hack works response.css('.star-rating').xpath('./@class').extract()[0].‌​split(' ')[1][-1]
    • Jan
      Jan over 6 years
      Otherwise please give a demo link.
    • RabidCicada
      RabidCicada over 6 years
      Dan I'm fairly certain that xpath1 only operates on nodes in the dom. scrapy uses lxml which only implements xpath1. xpath2 has some nifty functions like matches, tokenize, and replace that you could use to directly get what you want. Otherwise Jan's answer is the best you will get
  • Dan
    Dan about 6 years
    Thanks, ended up combining the two answers to response.css('.star-rating').xpath("@class").re(r'count-(\d)‌​')[0]
  • Dan
    Dan about 6 years
    Thanks, ended up combining the two answers to response.css('.star-rating').xpath("@class").re(r'count-(\d)‌​')[0]
  • gangabass
    gangabass about 6 years
    @Dan You'll get an exception on pages without rating ([0] will not work for None)
  • Dan
    Dan about 6 years
    Thanks, but it looks like 1 star is the lowest allowed.