Creating loop to parse table data in scrapy/python

11,634

Solution 1

I think this is what you are looking for:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    divs = hxs.select('//tr[@class="someclass"]')
    for div in divs:
        item = TestBotItem()
        item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
        item['var2'] = div.select('./td[3]/p/span[2]/text()').extract() 
        item['var3'] = div.select('./td[4]/p/text()').extract()

        yield item

You loop on the trs and then use relative XPath expressions (./td...), and in each iteration you use the yield instruction.

You can also append each item to a list and return that list outside of the loop) like this (it's equivalent to the code above):

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    divs = hxs.select('//tr[@class="someclass"]')
    items = []

    for div in divs:

        item = TestBotItem()
        item['var1'] = div.select('./td[2]/p/span[2]/text()').extract()
        item['var2'] = div.select('./td[3]/p/span[2]/text()').extract() 
        item['var3'] = div.select('./td[4]/p/text()').extract()

        items.append(item)

    return items

Solution 2

You don't need HtmlXPathSelector. Scrapy already has built-in XPATH selector. Try this:

def parse(self, response):
    divs = response.xpath('//tr[@class="someclass"]')
    for div in divs:
        item = TestBotItem()
        item['var1'] = div.xpath('table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()[0]
        item['var2'] = div.xpath('table/tbody/tr[*]/td[3]/p/span[2]/text()').extract()[0] 
        item['var3'] = div.xpath('table/tbody/tr[*]/td[4]/p/text()').extract()[0]
        return item
Share:
11,634
Admin
Author by

Admin

Updated on June 16, 2022

Comments

  • Admin
    Admin almost 2 years

    Have python script using scrapy , which scrapes the data from a website, allocates it to 3 fields and then generates a .csv. Works ok but with one major problem. All fields contain all of the data, rather than it being separated out for each table row. I'm sure this is due to my loop not working and when it finds the xpath it just grabs all the data for every row before moving on to get data for the other 2 fields, instead of creating seperate rows

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        divs = hxs.select('//tr[@class="someclass"]')
        for div in divs:
            item = TestBotItem()
            item['var1'] = div.select('//table/tbody/tr[*]/td[2]/p/span[2]/text()').extract()
            item['var2'] = div.select('//table/tbody/tr[*]/td[3]/p/span[2]/text()').extract() 
            item['var3'] = div.select('//table/tbody/tr[*]/td[4]/p/text()').extract()
            return item
    

    The tr with the * increases in number with each entry on the website I need to crawl, and the other two paths slot in below. How do I edit this so it grabs the first set of data for say //table/tbody/tr[3] only, stores it for all three fields and then moves on to //table/tbody/tr[4] etc??

    Update

    Works correctly, however I'm trying to add some validation to the pipelines.py file to drop any records where var1 is more than 100%. I'm certain my code below is wrong, and also does "yield" instead of "return" stop the pipeline being used?

    from scrapy.exceptions import DropItem 
    
    class TestbotPipeline(object):
    def process_item(self, item, spider):
        if item('var1') > 100%:
            return item
        else: 
            raise Dropitem(item)
    
  • Admin
    Admin over 10 years
    if you could show me how to append to a list outside the loop that would help as I'm now trying to gather more data. I've tried to write some validation into the "pipelines.py" but it seems to ignore it and output all the data, even that which doesn't match the condition.
  • paul trmbrth
    paul trmbrth over 10 years
    I updated my answer with a list of items returned outside of the loop. You should probably share you spider and pipeline code in some pastebin service for us to check. Also, inside the loop on divs, make sure you're using relative XPath expressions, e.g. ./td[2]/p/span[2]/text() and not absolute ones like //td[2]/p/span[2]/text()