Use Django framework to make a data-center website, in order to schedule crawlers or ETL functions by django-Q, which is a multiprocessing task queue for Django and has high performance and easy way to controll in django-admin website. In order to monitor ETL processes, we store the hook result(log) in django_admin database, then we could use these data to push and analysis info to our web or third party.
mysql -u root
CREATE USER 'finlab'@'localhost' IDENTIFIED BY 'finlab';
GRANT ALL PRIVILEGES ON * . * TO 'finlab'@'localhost';
# change passward
ALTER USER 'finlab'@'localhost' IDENTIFIED BY 'passward';
git clone https://gitlab.com/finlab_company_class/tw_stock.git
cp -r tw_stock/data finlab_data_center/tw_stock_class/
add IP4(search) in gcp sql connection
use for django admin css no show in gunicorn server,run this command at first time in local use
python manage.py collectstatic
virtualenv venv
source venv/bin/activate
# primary module
pip install -r requirements-to-freeze.txt
# related module in requirements.txt
pip freeze > requirements.txt
Only admin_db router could be migrate, because we need elastic in stock database.Use raw sql to define tables in stock database, remember to record raw sql files in sql-records directory.
# default setting is dev
python manage.py migrate --database=admin_db
python manage.py migrate --settings=finlab_data_center.settings.pro --database=admin_db
python manage.py createsuperuser --database=admin_db
python manage.py shell_plus --notebook
import sys,os
sys.path.append("..")
import django
django.setup()
python manage.py shell
# only test
python manage.py runserver
# use gunicorn in production for security
gunicorn finlab_data_center.wsgi:application
python manage.py startapp us_data
use legacy db to export django-model format in model file
python manage.py inspectdb --database=us_db > us_data/models.py
-
Define raw sql, and paste raw sql in records dir.
-
Use inspectdb to print model codes in django ex:
# show on terminal python manage.py inspectdb --database=us_db # export to modelfile python manage.py inspectdb --database=us_db > us_data/models.py
-
Write crawlers and use
CrawlerProcess.specified_date_crawl or SqlImporter.add_to_sql
to create and test init data,don't use CrawlerProcess.auto_update_crawl firstly.
use etl/import_sql.py to control crawlers to import data.
python manage.py qcluster
we have two task funcs(to process for time series or non time series) and one hook func.
args only accept str or number
- crawlers.py: write crawlers
- tasks.py: write schedule task functions
- models.py: define data format.
use email to sent schdule daily report
docker build -t finlab-data-center -f Dockerfile .
docker tag finlab-data-center asia.gcr.io/rare-mender-288209/finlab-data-center:v0.1.3
docker push asia.gcr.io/rare-mender-288209/finlab-data-center:v0.1.3
docker run --env PORT=8001 --env DJANGO_ENV=pro -it -p 8001:8001 finlab-data-center
docker run --env PORT=8001 --env DJANGO_ENV=pro -it -p 8001:8001 -v "$(pwd):/app" finlab-data-center /bin/bash
# init
gcloud config set project rare-mender-288209
gcloud config set compute/zone asia-east1-a
gcloud container clusters create finlab-micro-service --num-nodes=1
gcloud container clusters get-credentials finlab-micro-service
kubectl create deployment finlab-data-center --image=asia.gcr.io/rare-mender-288209/finlab-data-center:v0.1.3
kubectl expose deployment finlab-data-center --type LoadBalancer \
--port 80 --target-port 8001
kubectl delete service finlab-data-center
gcloud container clusters delete finlab-data-center
kubectl create secret generic cloudsql-db-credentials --from-literal=DBACCOUNT=finlab_admin --from-literal=DBPASSWORD=qetuadgj --from-literal=DBHOST=127.0.0.1
gcloud iam service-accounts keys create ~/key.json \
--iam-account [email protected]
kubectl create secret generic cloudsql-instance-credentials --from-file=service_account.json="/Users/benbilly3/key.json"
kubectl create -f k8s_deploy.yaml
kubectl set image deployment/finlab-data-center finlab-data-center=asia.gcr.io/rare-mender-288209/finlab-data-center:v0.1.3
- config_json: note env args like db connection, api_key
- db_router.py: control multiple db connection
- settings dir: default env setting is dev, both two env are inherited from base.py. django-controller is setting in base.py.
- CHANGELOG.md: record work comments gor everyversion
- /scripts:record the command.sh