Knesset data scrapers and data sync
Uses the datapackage pipelines framework to scrape Knesset data and aggregate to different data stores (PostgreSQL, Elasticsearch, Files)
- public endpoints:
- https://next.oknesset.org/pipelines/ - pipelines dashboard
- Metabase dashboards for quick friendly visualizations of the data in DB:
- Graphana dashboards for metrics / analytics:
- internal admin interfaces - password required
- https://next.oknesset.org/metabase/ - user friendly DB queries and dashboards
- https://minio.oknesset.org/ - object storage
- https://next.oknesset.org/adminer/ - for admin DB access
- in adminer UI login screen, you should choose:
- System: PostgreSQL
- Server: db
- Username, Password, Database: secret
- in adminer UI login screen, you should choose:
- https://next.oknesset.org/flower/ - celery tasks management
- https://next.oknesset.org/grafana/ - Web UI for graphing metrics (via InfluxDB)
- deployment of this environment was done using Kubernetes (K8S) on Google Container Engine (GKE)
Looking to contribute? check out the Help Wanted Issues or the Noob Friendly Issues for some ideas.
Using windows with our docker environment is not currently recomended or supported. The build process seems to fail on numerous issues. We suggest that windows users either dual-boot to Linux, or run Linux in virtualbox. Best supported version is Ubuntu 17.04 If you wish to use windows, do so at your own risk, and please update this README file with instructions if you succeed.
- Install Docker
- Ubuntu - Docker Official Docs - Ubuntu installation - The recommended method is "Install using the repository")
- Mac - https://store.docker.com/editions/community/docker-ce-desktop-mac
- Install docker-compose
- Ubuntu -
sudo apt install docker-compose
- Mac - should be installed as part of the toolbox
- Ubuntu -
- Make sure docker-compose is at version 1.13.0 or higher:
docker-compose --version
- If not, upgrade docker compose (refer to Docker-compose Official Docs)
- fork & clone the repo
- change directory to the repo's directory
sudo bin/start.sh
- verify all dockers started correctly:
sudo docker ps
(should show 3 images running - app, db, redis)
This will provide:
- Pipelines dashboard: http://localhost:5000/
- PostgreSQL server, pre-populated with data: postgresql://postgres:123456@localhost:15432/postgres
- Minio object storage: http://localhost:9000/
- Access Key =
admin
- Secret =
12345678
- Access Key =
- Adminer - DB Web UI: http://localhost:18080/
- Database Type = PostgreSQL
- Host = db
- Port = 5432
- Database = postgres
- User = postgres
- Password = 123456
After every change in the code you should run sudo bin/build.sh && sudo bin/start.sh
You should have an activated python 3.6 virtualenv, following procedure will work on Ubuntu 17.04:
curl -kL https://raw.github.com/saghul/pythonz/master/pythonz-install | bash
echo '[[ -s $HOME/.pythonz/etc/bashrc ]] && source $HOME/.pythonz/etc/bashrc' >> ~/.bashrc
source ~/.bashrc
sudo apt-get install build-essential zlib1g-dev libbz2-dev libssl-dev libreadline-dev libncurses5-dev libsqlite3-dev libgdbm-dev libdb-dev libexpat-dev libpcap-dev liblzma-dev libpcre3-dev
pythonz install 3.6.2
sudo pip install virtualenvwrapper
echo 'export WORKON_HOME=$HOME/.virtualenvs; export PROJECT_HOME=$HOME/Devel; source /usr/local/bin/virtualenvwrapper.sh' >> ~/.bashrc
source ~/.bashrc
cd knesset-data-pipelines
mkvirtualenv -a `pwd` -p $HOME/.pythonz/pythons/CPython-3.6.2/bin/python3.6 knesset-data-pipelines
Before running any knesset-data-pipelines script, be sure to activate the virtualenv
You can do that by running workon knesset-data-pipelines
Once you are inside a Python 3.6 virtualenv, you can run the following:
bin/install.sh
bin/test.sh
You can set some environment variables to modify behaviors, see a refernece at .env.example
- using docker:
bin/dpp.sh
- locally (from an activated virtualenv):
dpp
Warning this might seriously overload your CPU, use with caution..
docker-compose up -d redis db minio
source .env.example
for PIPELINE in `dpp | tail -n+2 | cut -d" " -f2 -`; do
dpp run "${PIPELINE}" &
done
You should have the committee and session id of a meeting that you want to investigate
In this example the session id is 284231
and committee id is 196
- Ensure an empty DB
sudo rm -rf .data-docker/postgresql
- Start the docker compose environment
bin/start.sh
- Wait ~20 seconds to ensure environment started properly
- Parse the protocols of the specific meeting / session id
OVERRIDE_COMMITTEE_MEETING_IDS=284231 bin/dpp.sh run ./committees/committee-meeting-protocols
- Check the parsed files in minio
- minio default username=
admin
, password=12345678
- original downloaded .doc:
284231.doc
- http://localhost:9000/minio/committees/protocols/original/196/ - parsed files:
284231.txt
/284231.csv
- http://localhost:9000/minio/committees/protocols/parsed/196/
- minio default username=