Crawling LinkedIn while authenticated with Scrapy

10,195
class LinkedPySpider(BaseSpider):

should be:

class LinkedPySpider(InitSpider):

Also you shouldn't override the parse function as I mentioned in my answer here: https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy

If you don't understand how to define the rules for extracting links, just have a proper read through the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors

Share:
10,195
Gates
Author by

Gates

Find Me: @MProgano

Updated on June 05, 2022

Comments

  • Gates
    Gates about 2 years

    So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful.

    I also am having an issue with the check_login_response() not sure what page it is checking.. Though "Sign Out" would make sense.




    ====== UPDATED ======

    from scrapy.contrib.spiders.init import InitSpider
    from scrapy.http import Request, FormRequest
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.contrib.spiders import Rule
    
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    
    from linkedpy.items import LinkedPyItem
    
    class LinkedPySpider(InitSpider):
        name = 'LinkedPy'
        allowed_domains = ['linkedin.com']
        login_page = 'https://www.linkedin.com/uas/login'
        start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]
    
        def init_request(self):
            #"""This function is called before crawling starts."""
            return Request(url=self.login_page, callback=self.login)
    
        def login(self, response):
            #"""Generate a login request."""
            return FormRequest.from_response(response,
                        formdata={'session_key': '[email protected]', 'session_password': 'somepassword'},
                        callback=self.check_login_response)
    
        def check_login_response(self, response):
            #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
            if "Sign Out" in response.body:
                self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
                # Now the crawling can begin..
    
                return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****
    
            else:
                self.log("\n\n\nFailed, Bad times :(\n\n\n")
                # Something went wrong, we couldn't log in, so nothing happens.
    
        def parse(self, response):
            self.log("\n\n\n We got data! \n\n\n")
            hxs = HtmlXPathSelector(response)
            sites = hxs.select('//ol[@id=\'result-set\']/li')
            items = []
            for site in sites:
                item = LinkedPyItem()
                item['title'] = site.select('h2/a/text()').extract()
                item['link'] = site.select('h2/a/@href').extract()
                items.append(item)
            return items
    



    The issue was resolved by adding 'Return' in front of self.initialized()

    Thanks Again! -Mark

  • Gates
    Gates about 12 years
    That did help. I see a log of Success. But I am not sure the def parse(self, response): is actually running. I tried putting a self.log() into there and nothing returned.
  • Gates
    Gates about 12 years
    It seems parse() should be parse_item()
  • Gates
    Gates about 12 years
    There is a GOOD chance the problem has to do with the above and allow=r'-\w+.html$' as I do not know what this is..
  • Gates
    Gates about 12 years
    (Updated based off these changes)