Giter Site home page Giter Site logo

amoussoubaruch / dsbox Goto Github PK

View Code? Open in Web Editor NEW

This project forked from andreiarion/dsbox

0.0 2.0 0.0 21.33 MB

Data Science box: Spark, Jupyter, R+RStudio, Zeppelin, Python 2 & 3, Java, Scala.

License: GNU General Public License v2.0

Jupyter Notebook 92.07% R 0.18% Shell 7.75%

dsbox's Introduction

Data Science box (dsbox)

This is a Linux (Ubuntu) box deployed by vagrant including the following Data Science apps:

It has been succesfully tested on both ubuntu/trusty32 and ubuntu/trusty64 systems.

Pre-deployment steps

To install the box, you must follow the next steps:

  1. Install VirtualBox: if you use any other provider, you must change the provider parameter in the Vagrantfile.
  2. Install Vagrant.
  3. Install Git.
  4. Clone this repository to a specific folder:
$ git clone https://github.com/mcolebrook/dsbox.git <YOUR_BOX_FOLDER>

Config parameters

Go to <YOUR_BOX_FOLDER>, and edit the Vagrantfile to change the parameters:

Parameter Description Default value
provider VM provider "virtualbox"
boxMaster OS in master node "ubuntu/trusty32"
boxSlave OS in slave nodes ubuntu/trusty32
masterRAM Master's RAM in MB 3072
masterCPU Master's CPU cores 2
masterName name of the master node used in scripts/spark-env-sh "spark-master"
masterIP private IP of master node "10.20.30.100"
slaves # of slaves 2 (max 9)
slaveRAM Slave's RAM in MB 1024
slaveCPU Slave's CPU cores 2
slaveName base name for slave nodes "spark-slave"
slavesIP base private IP for slave nodes "10.20.30.10"
IPythonPort IPython/Jupyter port to forward (set in Jupyter/IPython config file) 8001
SparkMasterPort SPARK_MASTER_WEBUI_PORT 8080
SparkWorkerPort SPARK_WORKER_WEBUI_PORT 8081
SparkAppPort Spark app web UI port 4040
RStudioPort RStudio server port 8787
ZeppelinPort Zeppelin default port is 8080 -> conflict with Spark 8888
SlidesPort jupyter-nbconvert <file.ipynb> --to slides --post serve 8000

Starting up and shuting down the cluster

You have several ways to start up the cluster.

Deploy the master and all the slaves

To deploy the cluster with one master node and two slave nodes by default:

$ vagrant up

Bear in mind that the whole process (bringing master+slaves up and the provisioning) may take several minutes!!

Deploy only the master

In case you only want to deploy the master node:

$ vagrant up spark-master

Halt the cluster

To shutdown the whole cluster:

$ vagrant halt

Halt only the master node

If you only want to halt the master node:

$ vagrant halt spark-master

Delete the whole cluster (master + slaves)

In case you want to delete the whole cluster:

$ vagrant destroy

Start/Stop Spark

To start up the Spark cluster (master + slaves):

$ vagrant ssh spark-master
...
$ $SPARK_HOME/sbin/start-all.sh

You can also start the cluster up from the host machine by typing:

$ vagrant ssh spark-master -c "bash /opt/spark/sbin/start-all.sh"

To halt the cluster, just run stop-all.sh. Remember that you can access Spark info in the following ports:

Starting Jupyter

The best way to start the Jupyter notebook is the following:

$ vagrant ssh spark-master
...
$ cd /vagrant/jupyter-notebooks
$ jupyter-notebook

Inside the folder jupyter-notebooks you may find some sample notebooks. Then, go to your favorite browser and type in localhost:8001. Besides, you can also start the Jupyter notebook with pyspark as the default interpreter by using the script scripts/start-pyspark-notebook.sh. Remember that inside the Jupyter notebook you can:

To stop the notebook, just press the keys Ctrl+C.

Starting RStudio

The RStudio Server daemon should be alreaday running in the background, so you only have to type in your browser localhost:8787. In order to work with Spark, you have to run the commands inside the config.R script. You may find helpful this RStudio cheat sheet.

Installing Zeppelin

I recommend you to build Zeppelin aside from the provision of the master node, since it takes a long time to complete the compilation. Thus, you can run the following lines, and wait until all modules are built.

$ vagrant ssh spark-master
$ cd /vagrant/scripts
$ sudo ./60-zeppelin.sh

Once all the modules are compiled inside the spark-master node, you can start Zeppelin typing:

$ sudo env "PATH=$PATH" /opt/zeppelin/bin/zeppelin-daemon.sh start

Remeber to use the same command with 'stop' to halt the daemon. Alternatively, you can run the script directly from the host machine by means of:

$ vagrant ssh spark-master -c "bash /opt/zeppelin/bin/zeppelin-daemon.sh start"

Finally, to start working with Zeppelin you may use the notebooks inside the folder /vagrant/zeppelin_notebooks.

License

GNU. Please refer to the LICENSE file in this repository.

Acknowledgements (in alphabetical order)

Thanks to the following people for sharing their projects: Adobe Research, Damián Avila, Dan Koch, Felix Cheung, Francisco Javier Pulido, Gustavo Arjones, IBM Cloud Emerging Technologies, Jee Vang, Jeffrey Thompson, José A. Dianes, Maloy Manna, NGUYEN Trong Khoa, and Peng Cheng.

Thanks also to the following people for pointing me out some bugs: Christos Iraklis Tsatsoulis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.