Extract class name in scrapy
11,029
Solution 1
You could use a combination of both somewhere in your code:
import re
classes = response.css('.star-rating').xpath("@class").extract()
for cls in classes:
match = re.search(r'\bcount-\d+\b', cls)
if match:
print("Class = {}".format(match.group(0))
Solution 2
You can extract rating directly using re_first()
and re()
:
for rating in response.xpath('//div[contains(@class, "star-rating")]/@class').re(r'count-(\d+)'):
print(rating)
Author by
Dan
I am a data scientist working with Python in London. I previously worked in quantitative finance in South Africa working in MATLAB.
Updated on June 15, 2022Comments
-
Dan almost 2 years
I am trying to scrape rating off of trustpilot.com.
Is it possible to extract a class name using scrapy? I am trying to scrape a rating which is made up of five individual images but the images are in a class with the name of the rating for example if the rating is 2 starts then:
<div class="star-rating count-2 size-medium clearfix">...
if it is 3 stars then:
<div class="star-rating count-3 size-medium clearfix">...
So is there a way I can scrape the class
count-2
orcount-3
assuming a selector like.css('.star-rating')
?-
Jan over 6 yearsYou could combine it with an xpath like
response.css('.star-rating').xpath("@class").extract()
(not tested). -
Dan over 6 yearsThanks, that returns
['star-rating count-4 size-medium clearfix']
which is close enough to get something working. But do you know if I can use xpath to only get the classes starting withcount-
? -
Jan over 6 yearsYou could try:
response.css('.star-rating').xpath(".//[contains(@class, 'count-')]/@class").extract()
-
Dan over 6 yearsThat errored, but this sort of hack works
response.css('.star-rating').xpath('./@class').extract()[0].split(' ')[1][-1]
-
Jan over 6 yearsOtherwise please give a demo link.
-
RabidCicada over 6 yearsDan I'm fairly certain that xpath1 only operates on nodes in the dom. scrapy uses lxml which only implements xpath1. xpath2 has some nifty functions like matches, tokenize, and replace that you could use to directly get what you want. Otherwise Jan's answer is the best you will get
-
-
Dan about 6 yearsThanks, ended up combining the two answers to
response.css('.star-rating').xpath("@class").re(r'count-(\d)')[0]
-
Dan about 6 yearsThanks, ended up combining the two answers to
response.css('.star-rating').xpath("@class").re(r'count-(\d)')[0]
-
gangabass about 6 years@Dan You'll get an exception on pages without rating (
[0]
will not work forNone
) -
Dan about 6 yearsThanks, but it looks like 1 star is the lowest allowed.