Python data scraping with Scrapy

10,979

Solution 1

Basically, you have plenty of tools to choose from:

These tools have different purposes but they can be mixed together depending on the task.

Scrapy is a powerful and very smart tool for crawling web-sites, extracting data. But, when it comes to manipulating the page: clicking buttons, filling forms - it becomes more complicated:

  • sometimes, it's easy to simulate filling/submitting forms by making underlying form action directly in scrapy
  • sometimes, you have to use other tools to help scrapy - like mechanize or selenium

If you make your question more specific, it'll help to understand what kind of tools you should use or choose from.

Take a look at an example of interesting scrapy&selenium mix. Here, selenium task is to click the button and provide data for scrapy items:

import time
from scrapy.item import Item, Field

from selenium import webdriver

from scrapy.spider import BaseSpider


class ElyseAvenueItem(Item):
    name = Field()


class ElyseAvenueSpider(BaseSpider):
    name = "elyse"
    allowed_domains = ["ehealthinsurance.com"]
    start_urls = [
    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
        if el:
            el.click()

        time.sleep(10)

        plans = self.driver.find_elements_by_class_name("plan-info")
        for plan in plans:
            item = ElyseAvenueItem()
            item['name'] = plan.find_element_by_class_name('primary').text
            yield item

        self.driver.close()

UPDATE:

Here's an example on how to use scrapy in your case:

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider


class AcrisItem(Item):
    borough = Field()
    block = Field()
    doc_type_name = Field()


class AcrisSpider(BaseSpider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')

        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]
        for document_class in document_classes:
            if document_class:
                doc_type = document_class.select('.//@value').extract()[0]
                doc_type_name = document_class.select('.//text()').extract()[0]
                formdata = {'__RequestVerificationToken': form_token,
                            'hid_selectdate': '7',
                            'hid_doctype': doc_type,
                            'hid_doctype_name': doc_type_name,
                            'hid_max_rows': '10',
                            'hid_ISIntranet': 'N',
                            'hid_SearchType': 'DOCTYPE',
                            'hid_page': '1',
                            'hid_borough': '0',
                            'hid_borough_name': 'ALL BOROUGHS',
                            'hid_ReqID': '',
                            'hid_sort': '',
                            'hid_datefromm': '',
                            'hid_datefromd': '',
                            'hid_datefromy': '',
                            'hid_datetom': '',
                            'hid_datetod': '',
                            'hid_datetoy': '', }
                yield FormRequest(url="http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentTypeResult",
                                  method="POST",
                                  formdata=formdata,
                                  callback=self.parse_page,
                                  meta={'doc_type_name': doc_type_name})

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)

        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
        for row in rows:
            item = AcrisItem()
            borough = row.select('.//td[2]/div/font/text()').extract()
            block = row.select('.//td[3]/div/font/text()').extract()

            if borough and block:
                item['borough'] = borough[0]
                item['block'] = block[0]
                item['doc_type_name'] = response.meta['doc_type_name']

                yield item

Save it in spider.py and run via scrapy runspider spider.py -o output.json and in output.json you will see:

{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}
...

Hope that helps.

Solution 2

If you simply want to submit the form and extract data from the resulting page, I'd go for:

Scrapy added value really holds in its ability to follow links and crawl a website, I don't think it is the right tool for the job if you know precisely what you are searching for.

Solution 3

I would personally use mechanize as I do not have any experience with scrapy. However a library named scrapy purpose built for screen scraping should be up for the task. I would just have a go with both of them and see which does the job best/easiest.

Share:
10,979
Sibtain Norain
Author by

Sibtain Norain

Updated on July 28, 2022

Comments

  • Sibtain Norain
    Sibtain Norain almost 2 years

    I want to scrape data from a website which has TextFields, Buttons etc.. and my requirement is to fill the text fields and submit the form to get the results and then scrape the data points from results page.

    I want to know that does Scrapy has this feature or If anyone can recommend a library in Python to accomplish this task?

    (edited)
    I want to scrape the data from the following website:
    http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType

    My requirement is to select the values from ComboBoxes and hit the search button and scrape the data points from the result page.

    P.S. I'm using selenium Firefox driver to scrape data from some other website but that solution is not good because selenium Firefox driver is dependent on FireFox's EXE i.e Firefox must be installed before running the scraper.

    Selenium Firefox driver is consuming around 100MB memory for one instance and my requirement is to run a lot of instances at a time to make the scraping process quick so there is memory limitation as well.

    Firefox crashes sometimes during the execution of scraper, don't know why. Also I need window less scraping which is not possible in case of Selenium Firefox driver.

    My ultimate goal is to run the scrapers on Heroku and I have Linux environment over there so selenium Firefox driver won't work on Heroku. Thanks

  • Sibtain Norain
    Sibtain Norain almost 11 years
    Selenium Firefox driver is consuming around 100MB memory for one instance and my requirement is to run a lot of instances at a time to make the scraping process quick so there is memory limitation as well. Firefox crashes sometimes during the execution of scraper, don't know why. Also I need window less scraping which is not possible in case of Selenium Firefox driver. My ultimate goal is to run the scrapers on Heroku and I have Linux environment over there so selenium Firefox driver won't work on Heroku.
  • alecxe
    alecxe almost 11 years
    Ok, looks like a simple html form that sends a post request. Just using scrapy should be enough. In theory, it should be like this: scrape the main page, get choices from the select fields, start Requests with a callback - crawl the data into scrapy items in the callback. If you want, I can provide an example.
  • Sibtain Norain
    Sibtain Norain almost 11 years
    Yes please, it'll be really helpful for me if you can give a small example of the implementation.
  • alecxe
    alecxe almost 11 years
    I've added an example. Reading Document Type values one by one, making FormRequests and parsing borough and block fields from search results. Consider accepting the answer if it was helpful. Happy scraping!
  • Sibtain Norain
    Sibtain Norain almost 11 years
    Thank you so much! :) The example you provided was really helpful for me.
  • Travis Leleu
    Travis Leleu about 9 years
    It's just my opinion, but I hugely prefer lxml over B.S. Significantly faster, and it has a fallback parser to BS. If you try to use lxml on malformed html, simply catch the exception and feed it into the lxml-built-in soup parser.
  • csabinho
    csabinho over 4 years
    But on the other hand it will work with the final DOM and not with the plain source code!