trivadispf / platys-modern-data-platform Goto Github PK

View Code? Open in Web Editor NEW

67.0 7.0 15.0 493.8 MB

Support for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....

License: Apache License 2.0

HTML 2.25% Dockerfile 0.42% Shell 4.03% Python 0.71% Scala 0.48% Java 0.06% Jinja 92.04% Jupyter Notebook 0.01%

docker kafka hadoop spark analytics-stack

platys-modern-data-platform's People

Contributors

Stargazers

Watchers

Forkers

lucafurrer tooptoop4 rgn atdavidpark j0hng4lt beartell stefanheimberg zyclove johnfelipe kpflugshaupt xyzlat nttq1sub lucasjellema ssun3 pascalhubacher

platys-modern-data-platform's Issues

Add Prometheus Service

Add https://hub.docker.com/r/prom/prometheus/ image to stack

Add InfluxData Tick Stack services

Add the Influx Data Tick Stack with the following services:

InfluxDB
Telegraf
Chronograf
Kapacitor

Add Hadoop Client to Zeppelin

it would be nice to have the dfs and/or hadoop fs commands available through the %sh interpreter of Zeppelin.

Add Event Sim to the stack?

Event Sim (https://github.com/josephadler/eventsim) is a data simulator which generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

Add Superset to stack

The visualization GUI Superset should be added to the stack.

Use this service as a base (https://github.com/johannestang/bigdata_stack/blob/master/docker-compose.yml):

  superset:
    image: amancevice/superset:0.28.1
    restart: always
    depends_on:
      - superset-postgres
      - superset-redis
    environment:
      MAPBOX_API_KEY: ${MAPBOX_API_KEY}
    ports:
      - "8088:8088"
    volumes:
      - ./config/superset_config.py:/etc/superset/superset_config.py

Build an image of Minio which support other architectures (ARM)

The current Minio container only runs on x86 hardware. It would be interesting to have Minio support on ARM hardware as well.

Add OS Architecture to Configuration documentation

Add information on the supported OS architecture to the Configuration file.

Add dependencies to the header of the template

Currently dependencies between containers are handled in the template when generating the container. For example for Spark we make sure that Hadoop is also added.

It would be better to have a separate section in the header which just overwrite the setting of an XXXX_enabled flag, so that the service declaration is cleaner (it will only check the XXX_enabled flag for the given service and not for dependencies as well).

Hue throw errors upon login

The Hue service shows 2 errors after login:

Solr server could not be contacted properly: HTTPConnectionPool(host='localhost', port=8983): Max retries exceeded with url: /solr/admin/info/system?user.name=hue&doAs=hue&wt=json (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))

and

database is locked

Hive and HDFS browser seem to work, but not sure if these errors cause some side-effects.

Replace Postgres image with offical one

Currently the stack is using the mujz/pagila image for postgreSQL. Replace it by the official one to also get ARM support.

The reason the pagila was used is to get some data pre-loaded. This should be possible with the official image as well and need to be documented in the how to.

Add LDAP as a service

Add LDAP as a service so that it can be used for authentication of other services.

Look at https://hub.docker.com/r/osixia/openldap as a candiate.

Generate a documentation page with a list of UI's and API's provided by the stack

Generate a documentation page with a table for the API's and another table for the UI's offered by the stack.

The information can be retrieved from the labels in the generated docker-compose.ymlfile. This will be added by #36.

Medatadata for a webui will look like shown below

  kafka-manager:
    image: trivadis/kafka-manager:latest
    container_name: kafka-manager
    hostname: kafka-manager
    labels:
      com.mdps.service.webui.url: http://${PUBLIC_IP}:28038

The documentation could be as part of a Markdown file or as an web-application started as a separate container.

mdps-generate does not work if the output folder (arg 2) does not exist

The mdps-generate script does not work if there is no docker folder.
Change the script to create the folder if not there.

Link update needed in README

Hi Guido.

The urls in the README of the docker subdirectory need some refactoring:

Download the code of the docker-compose.yml file from GitHub using wget or curl, i.e.

(wget https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker-compose.yml)

The Docker file was moved to:
(https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker/docker-compose.yml)

Also the configuration files are needed in the conf folder. How about:

#!/bin/bash
mkdir conf
cd conf
base="https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker/conf/"
for file in hadoop-hive.env hadoop.env hive-site.xml hue.ini
do
    wget "$base$file"
done

Best Regards,
Philip

Add StreamSets Transformer to stack

Add the https://hub.docker.com/r/streamsets/transformer image to the stack

Add Maria DB service

Add Maria DB as another RDBMS service. Use this image: https://hub.docker.com/_/mariadb.

In contrast to MySQL, the MariaDB image is also supported on ARM hardware.

  mariadb:
    image: mariadb/server:10.3
    container_name: mariadb
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: "${MYSQL_ROOT_PASSWORD}"
      MYSQL_DATABASE: ha_db
      MYSQL_USER: homeassistant
      MYSQL_PASSWORD: "${HA_MYSQL_PASSWORD}"
    user: "${LOCAL_USER}:${LOCAL_USER}"
    volumes:
      # Local path where the database will be stored.
      - <local db path>:/var/lib/mysql
    ports:
      - "3306:3306"

Add option to the playts gen command to copy generated artifacts to remote server

Add an option to platys script so that the generated artifacts can be copied to a remote server.

sshpass -p "hypriot" scp -rp docker [email protected]:/tmp

sshpass allows to pass the password so that the user does not have to enter it manually. In order to not have to install sshpass, add it to the generator container so that it can be started through that container.

Replace FTP Server with an image with multi-arch support

The current FTP Server only runs on x86 architectures but not on arm. Check for a replacement image which offers the same support as the current one but also runs on ARM. This one is a candidate: https://hub.docker.com/r/gists/pure-ftpd

Add Azure Storage Support

The same as for AWS S3 should also be possible with Azure Storage.

Add the necessary dependencies to spark-defaults.conf and install the CLI into Zeppelin.

Add InfluxDB to stack

https://medium.com/better-programming/the-tick-stack-as-a-docker-application-package-1d0d6b869211

Add Adminer and PostgreSQL instance to stack

add the two services similar to the DLQ sample to the stack

Add TimescaleDB service

Add TimescaleDB, a timeseries option on top of PostgreSQL to the stack. TimescaleDB is in use by some of our customers already and should therefore be part of the MDP stack as well.

Use this image: https://hub.docker.com/r/timescale/timescaledb/

Use AWS Glue Data Catalog for Hive Metastore

Change Schema-Registry Port to 8089 in SDC pipeline

Add zeek network security monitor to stack

Add Zeek Network Security Monitor to stack as a streaming source and configure it to send data to Kafka. The docker image can be found here: https://hub.docker.com/r/blacktop/zeek

Add Presto to Stack

Add Presto to the stack.

For Presto:

  presto-coordinator:
    image: johannestang/prestodb:0.215
    restart: always
    ports:
      - "8080:8080"
    environment:
      S3_ACCESS_KEY: ${MINIO_ACCESS_KEY}
      S3_SECRET_KEY: ${MINIO_SECRET_KEY}
      S3_ENDPOINT: "http://minio:9000"

Add Presto to the Hue configuration (checkout https://github.com/johannestang/bigdata_stack/blob/master/docker-compose.yml)

Revise KAFKA_ADVERTISED_LISTENERS for Kafka brokers

Revise the usage of KAFKA_ADVERTISED_LISTENERS according to Robin's blog: https://rmoff.net/2018/08/02/kafka-listeners-explained/

Add Livy-Manager to the stack

https://github.com/kjmrknsn/livy-manager

Change default password of Streamsets

Configure Hive-Server Authentication

Configure Hive to Use the LDAP provided with the Stack.

http://tate.cx/configuring-hive-hiveserver2-to-use-ldap-or-ldaps-for-authentication/

Add metadata to services in docker-compose stack

We can use the labels section of the docker-compose.yml to add metadata. One use case is to specify the URLs for Web UI or REST APIs as labels, so that some documentation can be generated with hyperlinks to the service UIs and APIs.

The following metadata should be added

For APIs

com.mdps.service.restapi.url - the URL of the REST API provided by the service
com.mdps.service.restapi.name - the name of the service

For UIs

com.mdps.service.webui.url - the URL of the WEB UI provided by the service
com.mdps.service.webui.name - the name of the WEB UI provided by the service

A service can also specify both, if he offers both a REST API and a WEB UI

Add Streamsets Edge to Stack

Add the Streamsets Edge container to the stack, so that it can be used as well for testing.

Amazon s3a returns 400 Bad Request with Spark and Zepplin

If you'd like to anyway use the region that supports Signature V4 in spark you can pass flag -Dcom.amazonaws.services.s3.enableV4 to the driver options and executor options on runtime. For example:

spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    ... (other spark options)

With this settings Spark is able to write to Frankfurt (and other V4-only regions) even with not-so-fresh AWS sdk version (com.amazonaws:aws-java-sdk:1.7.4 in my case)

Rename Kafka services from "broker-N" to "kafka-N"

Rename service as in the long term broker-N is not the best name for the service. We also support MQTT brokers ....

Support more than 1 Zookeeper Nodes

Currently the number of Zookeeper nodes is hardcoded to 1. Add support for multiple Zookeeper nodes (3). See this docker-compose.yml for reference: https://github.com/confluentinc/cp-docker-images/blob/5.3.1-post/examples/kafka-cluster/docker-compose.yml

Cannot access S3 from Zeppelin

When using Zeppelin Spark Interpreter to access Spark, I always get a "bucket does not exists" error.

Add a Python container to the stack

Add the https://hub.docker.com/_/python container to the stack

Make "data-transfer" folder configurable

Currently the data-transfer folder is mapped into some of the services (where it makes sense). This should be configurable, so that it can be disabled.

Add imply or druid to the stack

Make sure that Spark works with the Hive metastore when setting catalogImplementation=hive

Currently Spark does not work properly, when the property spark.sql.catalogImplemenation is set to hive

Spark Interpreter timeout when downloading dependencies

The Spark Interpreter throws an exception, when upon initializing the session, dependencies have to be downloaded due to using the spark.jars.packages property. This happens when the dependencies for AWS integration are added.

Add support for AMD vs ARM based stacks

We should be able to specify the OS Architecture when generating the docker-compose.yml, so that the correct containers are being used for the given platform.

We should support x86-64 and arm (for Raspberry PIs)

Make Spark Log Dir configurable for Minio or HDFS

Currently the spark.history.fs.logDirectory and spark.eventLog.dir in spark-defaults.conf is hardcoded to use HDFS. This way spark is dependent on a full Hadoop Stack.

It would be better if for the log we could choose:

"none" - no history is written
MINIO - use minio for the log
HDFS - use Hadoop HDFS for the log

Port mdps_generate bash script to Windows

The mdps_generate.sh script currently works for Linux and Mac.

We need support for Windows in the same way, so that it can be downloaded and run from Windows.

Support a URL to the custom.yml file as a parameter to the mdps_generate script

At the moment themdps_generatescript assumes that the 1st parameter references the custom.yml using an absolute path.
For automatic, script deployment scenarios it would nice, if an URL could be passed as well. By that the config.yml could be held anywhere on the internet (i.e. as a simple gist) and would be downloaded by the mdps_generate script or inside the docker container which is run by the script.

This would be helpful to setup a simple stack on Lightsail.