isaacmg / fb_scraper Goto Github PK

View Code? Open in Web Editor NEW

64.0 64.0 21.0 2.67 MB

FBLYZE is a Facebook scraping system and analysis system.

License: Apache License 2.0

Python 3.13% Jupyter Notebook 96.81% Shell 0.04% Dockerfile 0.03%

extract-data facebook-scraper flink kafka spark tf-idf

fb_scraper's Introduction

FBLYZE: a Facebook page and group scraping and analysis system.

Getting started tutorial on Medium.

The goal of this project is to implement a Facebook scraping and extraction engine. This project is originally based on the scraper from minimaxir which you can find here. However, our project aims to take this one step further and create a continous scraping and processing system which can easily be deployed into production. Specifically, for our purposes we want to extract information about upcoming paddling meetups, event information, flow info, and other river related reports. However, this project should be useful for anyone who needs regular scrapping of FB pages or groups.

Instructions

To get the ID of a Facebook group go here and input the url of the group you are trying to scrape. Pages you can just use after the slash (i.e. http://facebook.com/paddlesoft would be paddlesoft).

Update we have switched to using a DB for recording information. Please see documentation for revised instructions.

Docker

We recommend you use our Docker images as it contains everything you need. For instructions on how to use our Dockerfile please see the wiki page. Our Dockerfile is tested regularly on Codefresh so you can easily see if the build is passing above.

Running Locally

You will need to have Python 3.5+. If you want to use the examples (located in /data) you will need Jupyter Notebooks and Spark.

Create a file called app.txt and place your app_id in it along with your app_secret. Alternatively you can set this up in your system environment variables in a way similar to the way you would for Docker.
Use get_posts.py to pull data from a FB Group. So far we have provided five basic functions. Basically you can either do a full scrape or scrape from the last time stamp. You can also choose whether you want to write to a CSV or send the posts as Kafka messages. See get_posts.py for more details. Example:

from get_posts import scrape_comments_from_last_scrape, scrape_posts_from_last_scrape
group_id = "115285708497149"
scrape_posts_from_last_scrape(group_id)
scrape_comments_from_last_scrape(group_id)

Note that our messaging system using Kafka currently only works with the basic json data (comparable to the CSV). We are working on addeding a new schema for the more complex data see issue 11. Plans to upgrade to add authentication for Kafka authentication are in progress.
Currently the majority of examples of actual analysis are contained in the Examining data using Spark.ipynb notebook located in the data folder. You can open the notebook and specify the name of your CSV.
ElasticSearch is ocassionally throwing an authentication error when to trying to save posts. If you get an authentication error when using ES please add it to issue 15. Ability to connect to Bonsai and elastic.co are in the works.
There are some other use case examples on my main GitHub page which you can look at as well. However, I have omitted them from this repo since they are mainly in Java and require Apache Flink.
We are also working on automating scraping with Apache Airflow. The dags we have created so far are in the dags folder. It is recomended that you use the dags in conjunction with our Docker image. This will avoid directory errors.

Scrape away!

fb_scraper's People

Contributors

Stargazers

Watchers

fb_scraper's Issues

Facebook API access

Any help on how to get permission to access their API? They are requesting I upload an private policy, picture of my app logo, and a bunch of other stuff. I am just doing this for research and not creating an app. Any help in navigating this complicated space would be helpful.

Add visualization of lda

Need to add visualization of LDA using pyLDAvis or something else.

Clean up extraneous files into directories

Pretty self explanatory a lot of junk in the main directory.

Finish inverted index, document index part

Need to finish the making the section of the notebook to build the inverted index and document index.

Support for Kafka authentication

Some clusters might require authentication for Kafka producers. Need to added authentication functionality.

Fix Kafka tests

Need to fix Kafka tests. They keep throwing some stupid threading error related to a timeout.
i.e. timeout value is too large

Shelve corruption

Occasionally when testing an error is occurring when opening shelve file.

Need new Avro Schema for following format

We need an Avro schema that can deal with something like the following:
{'from': {'name': 'Elizabeth Austen', 'id': '10212551456802570'}, 'link': 'https://www.facebook.com/events/259740027809063/permalink/276873696095696/', 'created_time': '2017-04-21T16:35:29+0000', 'type': 'link', 'name': 'Daniel', 'id': '1043983518950523_1649940175021518', 'shares': {'count': 1}, 'reactions': {'data': [], 'summary': {'total_count': 2, 'viewer_reaction': 'NONE'}}, 'comments': {'data': [], 'summary': {'order': 'chronological', 'total_count': 0, 'can_comment': False}}, 'group_id': '1043983518950523', 'reacts': {'data': [{'id': '10155178339184493', 'type': 'LIKE'}, {'id': '612725696831', 'type': 'LIKE'}], 'paging': {'cursors': {'before': 'TlRVNE1EVTVORGt5T2pFME9USTRNREl3TlRZANk1qVTBNRGsyTVRZAeE13PT0ZD', 'after': 'TVRJd01EQXdNVEF5T2pFME9USTNPVE13TlRnNk1qVTBNRGsyTVRZAeE13PT0ZD'}}}}

ids not consistent

ids are not consistent between doc index and inverted. This is causing wrong text to be returned when performing a simple search. This needs to be fixed immediately.

Clean up sloppy code

Clean about sloppy code move more things to defs and combine multiple maps into one line. Better comments also.

Replace shelve method

This is a large task, but in the end shelve is just not working the way it should. It is causing the following ` error. scrape(page_id, from_time, useKafka, useES) File "/fb_scraper/fb_scrapper.py", line 60, in scrape pageStamp = get_tstamp(page_id, tstamp, "save_times") File "/fb_scraper/fb_scrapper.py", line 19, in get_tstamp with shelve.open(path) as d: File "/opt/conda/lib/python3.6/shelve.py", line 243, in open return DbfilenameShelf(filename, flag, protocol, writeback) File "/opt/conda/lib/python3.6/shelve.py", line 227, in __init__ Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback) File "/opt/conda/lib/python3.6/dbm/__init__.py", line 94, in open return mod.open(file, flag, mode) AttributeError: module 'dbm.gnu' has no attribute 'open'
Moreover it's making it impossible to scale containers without rescraping. Rescraping is likely to happen whenever the image is repulled and shelved file is destroyed.

Docker Error KeyError

I have run docker for first time and I get keyerror, it seems code is trying to get postgress user and database, So is it needed to be created on base system ?
There were no instructions to setup DB on https://github.com/isaacmg/fb_scraper/wiki/Docker-Image

variables.list:

FB_ID=myappid
FB_KEY=mysecreate
IDS=cnn,paddlesoft,msnbc
# Include only if you want to scrape comments
COMMENTS=1
# Include below ONLY if you want to use Kafka.
USE_KAFKA=1
KAFKA_PORT=localhost:9092

Error:
docker run --env-file variables.list paddlesoft/fb_scraper Traceback (most recent call last): File "threaded_proc.py", line 6, in <module> from fb_scrapper import scrape_groups_pages File "/fb_scraper/fb_scrapper.py", line 2, in <module> from fb_posts import FB_SCRAPE File "/fb_scraper/fb_posts.py", line 11, in <module> from save_pg import save_post_pg File "/fb_scraper/save_pg.py", line 3, in <module> db = Database(os.environ['db'], user=os.environ['pg_user'], password=os.environ['pg_password'], host=os.environ['pg_host'], database=os.environ['pg_db']) File "/opt/conda/lib/python3.6/os.py", line 669, in __getitem__ raise KeyError(key) from None KeyError: 'db'

View the update logs.

Need better descriptions in notebook and more consistent commenting

Need better commenting in code and should have better descriptions before code segments. Descriptions should also be grammar checked.