Scraping data out of facebook using scrapy

16,750

The problem is that search results (specifically div initial_browse_result) are loaded dynamically via javascript. Scrapy receives the page before those actions, so there is no results yet there.

Basically, you have two options here:

If you go with first option, you should analyze all requests going during the page load and figure out which one is responsible for getting the data you want to scrape.

The second is pretty straightforward, but will definitely work - you just use other tool to get the page with loaded via js data, then parse it to scrapy items.

Hope that helps.

Share:
16,750
Aryabhatt
Author by

Aryabhatt

By Day: All around nerd, ML/AI enthusiast and learner for life By Night: All around nerd, ML/AI enthusiast and learner for life Some days: Artist, traveler, volleyball

Updated on June 14, 2022

Comments

  • Aryabhatt
    Aryabhatt almost 2 years

    The new graph search on facebook lets you search for current employees of a company using query token - Current Google employees (for example).

    I want to scrape the results page (http://www.facebook.com/search/104958162837/employees/present) via scrapy.

    Initial problem was facebook allows only a facebook user to access the information, so directing me to login.php. So, before scraping this url, I logged in via scrapy and then this result page. But even though the http response is 200 for this page, it does not scraps any data. The code is as follows:

    import sys
    from scrapy.spider import BaseSpider
    from scrapy.http import FormRequest
    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.item import Item
    from scrapy.http import Request
    
    class DmozSpider(BaseSpider):
        name = "test"
        start_urls = ['https://www.facebook.com/login.php'];
        task_urls = [query]
    
    def parse(self, response):
    return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)]
    
    def after_login(self,response):
        if "authentication failed" in response.body:
                self.log("Login failed",level=log.ERROR)
                return
        return Request(query, callback=self.page_parse)
    
    def page_parse(self,response):
    
        hxs = HtmlXPathSelector(response)
        print hxs
        items = hxs.select('//div[@class="_4_yl"]')
        count = 0
        print items
    

    What could I have missed or done incorrectly?