Scraping Data from Facebook with Python

49,755

Solution 1

The right way to do this is to use the facebook API. For various business, security, and privacy reasons they go out of their way to make scraping data tricky.

If you insist on scraping I would try to log in first using mechanize to submit the form. I've never tried to do this with facebook, but alot of websites have easier to parse versions intended for mobile users at m.site.com.

Solution 2

You should look into using facepy by Johannes Gorset. He has done a brilliant job. I used it when I worked on a small Facebook app for a personal project.

Solution 3

I think scraping data from facebook is illegal. It is there in the terms of using facebook. Every activity is registered with your login details, even when you use a bot to scrape. If caught, they can ban you from using facebook for your lifetime. If there is a potential threat to any asset that you may pose, they can penalize you further.

Solution 4

You can try using selenium and Facebook API. I also had to scrape some similar data from list of testing Facebook accounts and selenium webdriver helped to emulate as real user and to scrape the required data.

Share:
49,755
cscanlin
Author by

cscanlin

Updated on June 11, 2020

Comments

  • cscanlin
    cscanlin almost 4 years

    I've been trying for several day now (unsuccessfully) to scrape cities from about 500 Facebook URLs. However, Facebook handles its data in a very strange way and I can't figure out what's going on under the hood to understand what I need to do.

    Essentially the problem is that Facebook displays very different amounts of data depending on who is logged in, and what the privacy settings of the account are. For instance, try opening the following three links, both in a browser where you are logged into Facebook, and one where you are not:

    [REDACTED LINKS DUE TO PRIVACY CONCERNS]

    As you can see, Facebook loads the data in both cases for the first link, but only gets data for the second link if you are logged in (to ANY account). The third link displays city when you are logged in, but only displays other information when you are not.

    The reason this is extremely problematic (and related to Python) is that when trying to scrape the page with Beautiful Soup or Mechanize, I cannot figure out how to get the program to "pretend" that I am logged into an account. This means that I can easily grab data off the first type of link (of which there are less than 10), but I cannot get city off the second or third type. So far I've tried a number of solutions with little success.

    Here's some sample code that works correctly for the first type, but not for other types:

    import mechanize
    import re
    import csv
    
    user_info = []
    
    fb_url = 'http://www.facebook.com/100004210542493'
    br = mechanize.Browser()
    br.set_handle_robots(False)
    
    br.open(fb_url)
    all_html = br.response().get_data()
    print all_html
    
    city = re.search('fsl fwb fcb">(.+?)</a></div><div class="aboutSubtitle fsm fwn fcg', all_html).group(1)
    
    user_info = [fb_url, city]
    print user_info
    

    I also have a version that uses Beautiful Soup. If anyone has any ideas on how to get around this, I would be extremely grateful. Thank you!

  • cscanlin
    cscanlin over 10 years
    as far as I can tell facebook does not allow you to access a user's city unless the user specifically allows you to. Do you know of any way around this or would I have to request access from all of them?
  • CBroe
    CBroe over 10 years
    “facebook does not allow you to access a user's city unless the user specifically allows you to.” – yes, it’s called privacy, please look it up. “Do you know of any way around this” – so you are explicitly asking how to violate a user’s privacy …?
  • Nabin
    Nabin almost 9 years
    Any luck on scraping from pages as well as groups in facebook? :-)
  • cscanlin
    cscanlin about 7 years
    It's hilarious and somewhat painful looking back at this question years later. Anyways I just realized I never accepted an answer, and this is the best answer to the question. So thanks for helping a misguided noobie!
  • Tennessee Leeuwenburg
    Tennessee Leeuwenburg almost 7 years
    Did you compare facepy vs facebook-sdk? I'm trying to choose the best library at the moment..
  • shinzou
    shinzou over 6 years
    @CBroe no, the user chose to expose this info so it's public. As a user you can see this so what's the difference in letting the api expose it? Facebook and privacy don't mix in case you haven't noticed.
  • CBroe
    CBroe over 6 years
    @shinzou public via the website means available to an actual human being, that can click through a few sites and look at the info for a handful of people - no big deal. Publicly available via API would mean shady app developer from shithole country of your choice could gather this data en masse, and (ab-)use it for whatever evil purpose they have in mind.
  • shinzou
    shinzou over 6 years
    And that actual human can't abuse it? How is this different from a public forum? How is this different from keeping a log of an irc channel? Logging a twitter account? and so on. Also, how are they going to abuse PUBLIC information that users explicitly post on the internet??? if you write things you might regret maybe don't post them? it's only the poster's fault. @CBroe
  • CBroe
    CBroe over 6 years
    @shinzou of course the human could do it, too - but it would take them much more time to gather that info for, say, a hundred thousand users, whereas (assuming user ids are known) an app doing this via API could do that in virtually no time at all ...
  • taga
    taga almost 5 years
    For instagram there is very good lib called 'instaloader', is there something like that for facebook? I came a cross this, but I could not find more information and more functions: pypi.org/project/facebook-scraper/#description