Crawling LinkedIn while authenticated with Scrapy
class LinkedPySpider(BaseSpider):
should be:
class LinkedPySpider(InitSpider):
Also you shouldn't override the parse
function as I mentioned in my answer here: https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy
If you don't understand how to define the rules for extracting links, just have a proper read through the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors
Comments
-
Gates about 2 years
So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful.
I also am having an issue with the check_login_response() not sure what page it is checking.. Though "Sign Out" would make sense.
====== UPDATED ======
from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import Rule from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from linkedpy.items import LinkedPyItem class LinkedPySpider(InitSpider): name = 'LinkedPy' allowed_domains = ['linkedin.com'] login_page = 'https://www.linkedin.com/uas/login' start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"] def init_request(self): #"""This function is called before crawling starts.""" return Request(url=self.login_page, callback=self.login) def login(self, response): #"""Generate a login request.""" return FormRequest.from_response(response, formdata={'session_key': '[email protected]', 'session_password': 'somepassword'}, callback=self.check_login_response) def check_login_response(self, response): #"""Check the response returned by a login request to see if we aresuccessfully logged in.""" if "Sign Out" in response.body: self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n") # Now the crawling can begin.. return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM***** else: self.log("\n\n\nFailed, Bad times :(\n\n\n") # Something went wrong, we couldn't log in, so nothing happens. def parse(self, response): self.log("\n\n\n We got data! \n\n\n") hxs = HtmlXPathSelector(response) sites = hxs.select('//ol[@id=\'result-set\']/li') items = [] for site in sites: item = LinkedPyItem() item['title'] = site.select('h2/a/text()').extract() item['link'] = site.select('h2/a/@href').extract() items.append(item) return items
The issue was resolved by adding 'Return' in front of self.initialized()
Thanks Again! -Mark
-
Gates about 12 yearsThat did help. I see a log of Success. But I am not sure the
def parse(self, response):
is actually running. I tried putting a self.log() into there and nothing returned. -
Gates about 12 yearsIt seems
parse()
should beparse_item()
-
Gates about 12 yearsThere is a GOOD chance the problem has to do with the above and
allow=r'-\w+.html$'
as I do not know what this is.. -
Gates about 12 years(Updated based off these changes)