The chinese-firewall-bridge from stepandel

This is Spark data streaming application that serves as a bridge over Chinese Great Firewall

Architecture

Lambda(China) -> Kinesis(China) -> Spark Cluster(US) -> Kinesis(US) -> Firehose(US) -> S3(US)
    |                                     |
   \/                                    \/
Firehose(China)                  **Spark Application**
    |                                     |
    \/                                    \/
S3 Backup(China)                       Spark UI



Lambda(China) - kk-analytics-firehose-prod-webhook
Kinesis(China) - prod-kinesis-data-streaming
Firehose(China) - prod-analytics-backup-firehose
S3 Backup(China) - prodanalytics-firehose-bucket/backup
Kinesis(US) - prod-china-kinesis-data-streaming
Firehose(US) - prod-china-analytics-firehose-parquet
S3(US) - prodanalytics-firehose-bucket/parquet

Set up

Local Machine (MAC)

Install Java: brew cask install java
Install Scala: brew install scala https://www.scala-lang.org/
Install SBT: brew install sbt https://www.scala-sbt.org/
Install Spark: brew install apache-spark https://spark.apache.org/
Install Flintrock: brew install flintrock https://github.com/nchammas/flintrock

Remote Machine (Linux)

Install Java: sudo yum install java-1.8.0-openjdk-devel
Follow this guide to install spark

First Time Deployment

Build fat jar: sbt assembly
Configure cluster: flintrock configure
Launch cluster: flintrock launch prod-spark-cluster
(If not enabled) Enable inbound port 7077, 8080 & 22 for cluster securty gorup in your AWS
Check master node public DNS by running flintrock describe
Check if cluster is running by going to http://master_public_dns:8080
Copy application jar to the cluster nodes (update local file path): flintrock copy-file prod-spark-cluster \ /Users/stepanarsentjev/Development/chinaDataStreaming/target/scala-2.11/china-data-streaming.jar \ /home/ec2-user/
ssh to master and all worker instances and set default aws profile with Chinese credentials (can be found in application.conf): aws configure (set region to cn-northwest-1)
Copy jar file to /home/ec2-user/ of a separate instance (ec2-54-196-74-76.compute-1.amazonaws.com) or if doesn't exist create a new one and install all spark dependencies: scp -i "pem-file.pem" /file/path ec2-user@machine-dns:/remote/path/to/file
ssh to the above instance
Deploy Spark app by running the following from ~ : (replace master url with the one found in the web ui) /opt/spark/bin/spark-submit --deploy-mode cluster --master spark://ec2-3-91-11-48.compute-1.amazonaws.com:7077 --driver-memory 10g /home/ec2-user/china-data-stream.jar

Maintanace

To make changes to the application running in production:

Restart cluster (all logs will be removed - so save what you need before restarting) flintrock stop prod-spark-cluster and flintrock start prod-spark-cluster

Upload updated jar file (china-data-stream.jar)

/Users/stepanarsentjev/Development/chinaDataStreaming/target/scala-2.11/china-data-streaming.jar /home/ec2-user/

Submit spark job to cluster from external instance (master url will be different!) /opt/spark/bin/spark-submit --deploy-mode cluster --master spark://ec2-54-234-92-94.compute-1.amazonaws.com:7077 --driver-memory 10g --executor-memory 10g /home/ec2-user/china-data-streaming.jar

INFO: Records on Kinesis Data Stream are stored for 24 hour (or until read) To avoid data leaks, redeploy spark cluster within 24 hours

Debuging

To increase spark deploy response timeOut; set spark.rpc.askTimeout & spark.network.timeput to 800 in /opt/spark-2.4.3-bin-hadoop2.6/conf/spark-defaults.conf

stepandel / chinese-firewall-bridge Goto Github PK

chinese-firewall-bridge's Introduction

Architecture

Set up

First Time Deployment

Maintanace

To make changes to the application running in production:

Debuging

chinese-firewall-bridge's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent