Giter Site home page Giter Site logo

isaacmg / fb_scraper Goto Github PK

View Code? Open in Web Editor NEW
64.0 64.0 21.0 2.67 MB

FBLYZE is a Facebook scraping system and analysis system.

License: Apache License 2.0

Python 3.13% Jupyter Notebook 96.81% Shell 0.04% Dockerfile 0.03%
extract-data facebook-scraper flink kafka spark tf-idf

fb_scraper's Introduction

FBLYZE: a Facebook page and group scraping and analysis system.

Travis Codecov branch Codefresh build status Join the chat at https://gitter.im/fb_scraper/Lobby Docker Pulls Code Health

Getting started tutorial on Medium.

The goal of this project is to implement a Facebook scraping and extraction engine. This project is originally based on the scraper from minimaxir which you can find here. However, our project aims to take this one step further and create a continous scraping and processing system which can easily be deployed into production. Specifically, for our purposes we want to extract information about upcoming paddling meetups, event information, flow info, and other river related reports. However, this project should be useful for anyone who needs regular scrapping of FB pages or groups.

Instructions

To get the ID of a Facebook group go here and input the url of the group you are trying to scrape. Pages you can just use after the slash (i.e. http://facebook.com/paddlesoft would be paddlesoft).

Update we have switched to using a DB for recording information. Please see documentation for revised instructions.

Docker

We recommend you use our Docker images as it contains everything you need. For instructions on how to use our Dockerfile please see the wiki page. Our Dockerfile is tested regularly on Codefresh so you can easily see if the build is passing above.

Running Locally

You will need to have Python 3.5+. If you want to use the examples (located in /data) you will need Jupyter Notebooks and Spark.

  1. Create a file called app.txt and place your app_id in it along with your app_secret. Alternatively you can set this up in your system environment variables in a way similar to the way you would for Docker.
  2. Use get_posts.py to pull data from a FB Group. So far we have provided five basic functions. Basically you can either do a full scrape or scrape from the last time stamp. You can also choose whether you want to write to a CSV or send the posts as Kafka messages. See get_posts.py for more details. Example:
from get_posts import scrape_comments_from_last_scrape, scrape_posts_from_last_scrape
group_id = "115285708497149"
scrape_posts_from_last_scrape(group_id)
scrape_comments_from_last_scrape(group_id)
  1. Note that our messaging system using Kafka currently only works with the basic json data (comparable to the CSV). We are working on addeding a new schema for the more complex data see issue 11. Plans to upgrade to add authentication for Kafka authentication are in progress.

  2. Currently the majority of examples of actual analysis are contained in the Examining data using Spark.ipynb notebook located in the data folder. You can open the notebook and specify the name of your CSV.

  3. ElasticSearch is ocassionally throwing an authentication error when to trying to save posts. If you get an authentication error when using ES please add it to issue 15. Ability to connect to Bonsai and elastic.co are in the works.

  4. There are some other use case examples on my main GitHub page which you can look at as well. However, I have omitted them from this repo since they are mainly in Java and require Apache Flink.

  5. We are also working on automating scraping with Apache Airflow. The dags we have created so far are in the dags folder. It is recomended that you use the dags in conjunction with our Docker image. This will avoid directory errors.

Scrape away!

fb_scraper's People

Contributors

gitter-badger avatar isaacmg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fb_scraper's Issues

Facebook API access

Any help on how to get permission to access their API? They are requesting I upload an private policy, picture of my app logo, and a bunch of other stuff. I am just doing this for research and not creating an app. Any help in navigating this complicated space would be helpful.

Fix Kafka tests

Need to fix Kafka tests. They keep throwing some stupid threading error related to a timeout.
i.e. timeout value is too large

Shelve corruption

Occasionally when testing an error is occurring when opening shelve file.

Need new Avro Schema for following format

We need an Avro schema that can deal with something like the following:
{'from': {'name': 'Elizabeth Austen', 'id': '10212551456802570'}, 'link': 'https://www.facebook.com/events/259740027809063/permalink/276873696095696/', 'created_time': '2017-04-21T16:35:29+0000', 'type': 'link', 'name': 'Daniel', 'id': '1043983518950523_1649940175021518', 'shares': {'count': 1}, 'reactions': {'data': [], 'summary': {'total_count': 2, 'viewer_reaction': 'NONE'}}, 'comments': {'data': [], 'summary': {'order': 'chronological', 'total_count': 0, 'can_comment': False}}, 'group_id': '1043983518950523', 'reacts': {'data': [{'id': '10155178339184493', 'type': 'LIKE'}, {'id': '612725696831', 'type': 'LIKE'}], 'paging': {'cursors': {'before': 'TlRVNE1EVTVORGt5T2pFME9USTRNREl3TlRZANk1qVTBNRGsyTVRZAeE13PT0ZD', 'after': 'TVRJd01EQXdNVEF5T2pFME9USTNPVE13TlRnNk1qVTBNRGsyTVRZAeE13PT0ZD'}}}}

ids not consistent

ids are not consistent between doc index and inverted. This is causing wrong text to be returned when performing a simple search. This needs to be fixed immediately.

Clean up sloppy code

Clean about sloppy code move more things to defs and combine multiple maps into one line. Better comments also.

Replace shelve method

This is a large task, but in the end shelve is just not working the way it should. It is causing the following ` error. scrape(page_id, from_time, useKafka, useES) File "/fb_scraper/fb_scrapper.py", line 60, in scrape pageStamp = get_tstamp(page_id, tstamp, "save_times") File "/fb_scraper/fb_scrapper.py", line 19, in get_tstamp with shelve.open(path) as d: File "/opt/conda/lib/python3.6/shelve.py", line 243, in open return DbfilenameShelf(filename, flag, protocol, writeback) File "/opt/conda/lib/python3.6/shelve.py", line 227, in __init__ Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback) File "/opt/conda/lib/python3.6/dbm/__init__.py", line 94, in open return mod.open(file, flag, mode) AttributeError: module 'dbm.gnu' has no attribute 'open'
Moreover it's making it impossible to scale containers without rescraping. Rescraping is likely to happen whenever the image is repulled and shelved file is destroyed.

Docker Error KeyError

I have run docker for first time and I get keyerror, it seems code is trying to get postgress user and database, So is it needed to be created on base system ?
There were no instructions to setup DB on https://github.com/isaacmg/fb_scraper/wiki/Docker-Image

variables.list:

FB_ID=myappid
FB_KEY=mysecreate
IDS=cnn,paddlesoft,msnbc
# Include only if you want to scrape comments
COMMENTS=1
# Include below ONLY if you want to use Kafka.
USE_KAFKA=1
KAFKA_PORT=localhost:9092

Error:
docker run --env-file variables.list paddlesoft/fb_scraper Traceback (most recent call last): File "threaded_proc.py", line 6, in <module> from fb_scrapper import scrape_groups_pages File "/fb_scraper/fb_scrapper.py", line 2, in <module> from fb_posts import FB_SCRAPE File "/fb_scraper/fb_posts.py", line 11, in <module> from save_pg import save_post_pg File "/fb_scraper/save_pg.py", line 3, in <module> db = Database(os.environ['db'], user=os.environ['pg_user'], password=os.environ['pg_password'], host=os.environ['pg_host'], database=os.environ['pg_db']) File "/opt/conda/lib/python3.6/os.py", line 669, in __getitem__ raise KeyError(key) from None KeyError: 'db'

Increase code coverage

Need unit tests for fb_posts.py, fb_posts_realtime.py, and fb_scrapper.py could also use one or two more.

Create tests and get Travis working

Travis is currently trying to test Kafka tests and as a result is failing. Either need to get Kafka running on Travis for tests to pass or remove them from the build check. Should be completed ASAP to avoid more failing builds.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.