Giter Site home page Giter Site logo

pivettamarcos / platform-ds Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aksl20/platform-ds

0.0 0.0 0.0 43 KB

This is a quick-and-dirty data analytics platform based on Spark, Hadoop and Jupyterhub. All this tools are deployed automatically with docker and docker-compose.

Shell 29.63% Makefile 2.34% Dockerfile 21.64% Jupyter Notebook 46.39%

platform-ds's Introduction

platform-ds

Datascience environment managed by Docker and Docker-compose. This stack creates a standalone Spark cluster with 2 workers, 1 master and 1 driver (jupyterlab). It also creates a hadoop yarn cluster with: 1 resource manager; 1 node manager with 1 namenode, 1 data node; a history server.

Prerequisites

Launch the platform

$ git clone <repo_url>
$ cd plateforme-ds
$ make

Then, if you want to start a spark-cluster

$ docker-compose -f spark-local.yml up -d

Or a spark on local jupyter container

$ docker-compose -f spark-cluster.yml up -d

You can access namenode container by running the following command:

$ docker exec -it namenode bash

You can access jupyter container to obtain the token key by running the following command:

$ docker exec -it jupyter bash
$ jupyter notebook list

Spark and Hadoop in jupyter

If you launch the spark cluster, you can connect to it in the jupyter notebook by running the following code:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName('test').setMaster('spark://spark-master:7077')
sc = SparkContext(conf=conf)

And read files from HDFS system as follow:

lines = sc.textFile("hdfs://namenode:9000/<your_path_to_the_file>")

Connect to the platform

  • go to the url http://<ip_or_hostname_server>:10000 to open a jupyterlab session
  • Hadoop nanemode: http://<ip_or_hostname_server>:9870
  • Hadoop datanode: http://<ip_or_hostname_server>:9864
  • Ressource Manager: http://<ip_or_hostname_server>:8088

Spark cluster

  • Spark master: http://<ip_or_hostname_server>:8585 (webui) or http://<ip_or_hostname_server>:7077 (jobs)
  • Spark worker-[x]: http://<ip_or_hostname_server>:808[x]

Spark local

  • Spark webui: http://<ip_or_hostname_server>:4040

TODO LIST

  • Add linked folder between jupyter container and host machine (handle permission issues)

Quick Start HDFS:

  • Copy breweries.csv to the namenode. docker cp data/1/cardio_base.csv namenode:cardio_base.csv
  • Go to the bash shell on the namenode with that same Container ID of the namenode. docker exec -it namenode bash
  • Create a HDFS directory /data/. hdfs dfs -mkdir -p /data/
  • Copy cardio_base.csv to HDFS: hdfs dfs -put cardio_base.csv /data/cardio_base.csv

platform-ds's People

Contributors

aksl20 avatar pivettamarcos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.