Giter Site home page Giter Site logo

trivadispf / platys-modern-data-platform Goto Github PK

View Code? Open in Web Editor NEW
67.0 7.0 15.0 493.8 MB

Support for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....

License: Apache License 2.0

HTML 2.25% Dockerfile 0.42% Shell 4.03% Python 0.71% Scala 0.48% Java 0.06% Jinja 92.04% Jupyter Notebook 0.01%
docker kafka hadoop spark analytics-stack

platys-modern-data-platform's People

Contributors

dependabot[bot] avatar gschmutz avatar lucafurrer avatar lucasjellema avatar ufasoli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

platys-modern-data-platform's Issues

Add dependencies to the header of the template

Currently dependencies between containers are handled in the template when generating the container. For example for Spark we make sure that Hadoop is also added.

It would be better to have a separate section in the header which just overwrite the setting of an XXXX_enabled flag, so that the service declaration is cleaner (it will only check the XXX_enabled flag for the given service and not for dependencies as well).

Hue throw errors upon login

The Hue service shows 2 errors after login:

Solr server could not be contacted properly: HTTPConnectionPool(host='localhost', port=8983): Max retries exceeded with url: /solr/admin/info/system?user.name=hue&doAs=hue&wt=json (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))

and

database is locked

Hive and HDFS browser seem to work, but not sure if these errors cause some side-effects.

Replace Postgres image with offical one

Currently the stack is using the mujz/pagila image for postgreSQL. Replace it by the official one to also get ARM support.

The reason the pagila was used is to get some data pre-loaded. This should be possible with the official image as well and need to be documented in the how to.

Generate a documentation page with a list of UI's and API's provided by the stack

Generate a documentation page with a table for the API's and another table for the UI's offered by the stack.

The information can be retrieved from the labels in the generated docker-compose.ymlfile. This will be added by #36.

Medatadata for a webui will look like shown below

  kafka-manager:
    image: trivadis/kafka-manager:latest
    container_name: kafka-manager
    hostname: kafka-manager
    labels:
      com.mdps.service.webui.url: http://${PUBLIC_IP}:28038

The documentation could be as part of a Markdown file or as an web-application started as a separate container.

Link update needed in README

Hi Guido.

The urls in the README of the docker subdirectory need some refactoring:

Download the code of the docker-compose.yml file from GitHub using wget or curl, i.e.

(wget https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker-compose.yml)

The Docker file was moved to:
(https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker/docker-compose.yml)

Also the configuration files are needed in the conf folder. How about:

#!/bin/bash
mkdir conf
cd conf
base="https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker/conf/"
for file in hadoop-hive.env hadoop.env hive-site.xml hue.ini
do
    wget "$base$file"
done

Best Regards,
Philip

Add Maria DB service

Add Maria DB as another RDBMS service. Use this image: https://hub.docker.com/_/mariadb.

In contrast to MySQL, the MariaDB image is also supported on ARM hardware.

  mariadb:
    image: mariadb/server:10.3
    container_name: mariadb
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: "${MYSQL_ROOT_PASSWORD}"
      MYSQL_DATABASE: ha_db
      MYSQL_USER: homeassistant
      MYSQL_PASSWORD: "${HA_MYSQL_PASSWORD}"
    user: "${LOCAL_USER}:${LOCAL_USER}"
    volumes:
      # Local path where the database will be stored.
      - <local db path>:/var/lib/mysql
    ports:
      - "3306:3306"

Add Azure Storage Support

The same as for AWS S3 should also be possible with Azure Storage.

Add the necessary dependencies to spark-defaults.conf and install the CLI into Zeppelin.

Add metadata to services in docker-compose stack

We can use the labels section of the docker-compose.yml to add metadata. One use case is to specify the URLs for Web UI or REST APIs as labels, so that some documentation can be generated with hyperlinks to the service UIs and APIs.

The following metadata should be added

For APIs

com.mdps.service.restapi.url - the URL of the REST API provided by the service
com.mdps.service.restapi.name - the name of the service

For UIs

com.mdps.service.webui.url - the URL of the WEB UI provided by the service
com.mdps.service.webui.name - the name of the WEB UI provided by the service

A service can also specify both, if he offers both a REST API and a WEB UI

Amazon s3a returns 400 Bad Request with Spark and Zepplin

If you'd like to anyway use the region that supports Signature V4 in spark you can pass flag -Dcom.amazonaws.services.s3.enableV4 to the driver options and executor options on runtime. For example:

spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    ... (other spark options)

With this settings Spark is able to write to Frankfurt (and other V4-only regions) even with not-so-fresh AWS sdk version (com.amazonaws:aws-java-sdk:1.7.4 in my case)

Spark Interpreter timeout when downloading dependencies

The Spark Interpreter throws an exception, when upon initializing the session, dependencies have to be downloaded due to using the spark.jars.packages property. This happens when the dependencies for AWS integration are added.

Add support for AMD vs ARM based stacks

We should be able to specify the OS Architecture when generating the docker-compose.yml, so that the correct containers are being used for the given platform.

We should support x86-64 and arm (for Raspberry PIs)

Make Spark Log Dir configurable for Minio or HDFS

Currently the spark.history.fs.logDirectory and spark.eventLog.dir in spark-defaults.conf is hardcoded to use HDFS. This way spark is dependent on a full Hadoop Stack.

It would be better if for the log we could choose:

  • "none" - no history is written
  • MINIO - use minio for the log
  • HDFS - use Hadoop HDFS for the log

Support a URL to the custom.yml file as a parameter to the mdps_generate script

At the moment themdps_generatescript assumes that the 1st parameter references the custom.yml using an absolute path.
For automatic, script deployment scenarios it would nice, if an URL could be passed as well. By that the config.yml could be held anywhere on the internet (i.e. as a simple gist) and would be downloaded by the mdps_generate script or inside the docker container which is run by the script.

This would be helpful to setup a simple stack on Lightsail.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.