trivadispf / platys-modern-data-platform Goto Github PK
View Code? Open in Web Editor NEWSupport for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....
License: Apache License 2.0
Support for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....
License: Apache License 2.0
Add https://hub.docker.com/r/prom/prometheus/ image to stack
Add the Influx Data Tick Stack with the following services:
it would be nice to have the dfs
and/or hadoop fs
commands available through the %sh
interpreter of Zeppelin.
Event Sim (https://github.com/josephadler/eventsim) is a data simulator which generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
The visualization GUI Superset should be added to the stack.
Use this service as a base (https://github.com/johannestang/bigdata_stack/blob/master/docker-compose.yml):
superset:
image: amancevice/superset:0.28.1
restart: always
depends_on:
- superset-postgres
- superset-redis
environment:
MAPBOX_API_KEY: ${MAPBOX_API_KEY}
ports:
- "8088:8088"
volumes:
- ./config/superset_config.py:/etc/superset/superset_config.py
The current Minio container only runs on x86 hardware. It would be interesting to have Minio support on ARM hardware as well.
Add information on the supported OS architecture to the Configuration file.
Currently dependencies between containers are handled in the template when generating the container. For example for Spark we make sure that Hadoop is also added.
It would be better to have a separate section in the header which just overwrite the setting of an XXXX_enabled flag, so that the service declaration is cleaner (it will only check the XXX_enabled flag for the given service and not for dependencies as well).
The Hue service shows 2 errors after login:
Solr server could not be contacted properly: HTTPConnectionPool(host='localhost', port=8983): Max retries exceeded with url: /solr/admin/info/system?user.name=hue&doAs=hue&wt=json (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))
and
database is locked
Hive and HDFS browser seem to work, but not sure if these errors cause some side-effects.
Currently the stack is using the mujz/pagila image for postgreSQL. Replace it by the official one to also get ARM support.
The reason the pagila was used is to get some data pre-loaded. This should be possible with the official image as well and need to be documented in the how to.
Add LDAP as a service so that it can be used for authentication of other services.
Look at https://hub.docker.com/r/osixia/openldap as a candiate.
Generate a documentation page with a table for the API's and another table for the UI's offered by the stack.
The information can be retrieved from the labels in the generated docker-compose.yml
file. This will be added by #36.
Medatadata for a webui will look like shown below
kafka-manager:
image: trivadis/kafka-manager:latest
container_name: kafka-manager
hostname: kafka-manager
labels:
com.mdps.service.webui.url: http://${PUBLIC_IP}:28038
The documentation could be as part of a Markdown file or as an web-application started as a separate container.
The mdps-generate
script does not work if there is no docker
folder.
Change the script to create the folder if not there.
Hi Guido.
The urls in the README of the docker subdirectory need some refactoring:
Download the code of the docker-compose.yml file from GitHub using wget or curl, i.e.
(wget https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker-compose.yml)
The Docker file was moved to:
(https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker/docker-compose.yml)
Also the configuration files are needed in the conf folder. How about:
#!/bin/bash
mkdir conf
cd conf
base="https://raw.githubusercontent.com/TrivadisBDS/modern-data-analytics-stack/master/docker/conf/"
for file in hadoop-hive.env hadoop.env hive-site.xml hue.ini
do
wget "$base$file"
done
Best Regards,
Philip
Add the https://hub.docker.com/r/streamsets/transformer image to the stack
Add Maria DB as another RDBMS service. Use this image: https://hub.docker.com/_/mariadb.
In contrast to MySQL, the MariaDB image is also supported on ARM hardware.
mariadb:
image: mariadb/server:10.3
container_name: mariadb
restart: unless-stopped
environment:
MYSQL_ROOT_PASSWORD: "${MYSQL_ROOT_PASSWORD}"
MYSQL_DATABASE: ha_db
MYSQL_USER: homeassistant
MYSQL_PASSWORD: "${HA_MYSQL_PASSWORD}"
user: "${LOCAL_USER}:${LOCAL_USER}"
volumes:
# Local path where the database will be stored.
- <local db path>:/var/lib/mysql
ports:
- "3306:3306"
Add an option to platys
script so that the generated artifacts can be copied to a remote server.
sshpass -p "hypriot" scp -rp docker [email protected]:/tmp
sshpass
allows to pass the password so that the user does not have to enter it manually. In order to not have to install sshpass
, add it to the generator container so that it can be started through that container.
The current FTP Server only runs on x86 architectures but not on arm. Check for a replacement image which offers the same support as the current one but also runs on ARM. This one is a candidate: https://hub.docker.com/r/gists/pure-ftpd
The same as for AWS S3 should also be possible with Azure Storage.
Add the necessary dependencies to spark-defaults.conf and install the CLI into Zeppelin.
add the two services similar to the DLQ sample to the stack
Add TimescaleDB, a timeseries option on top of PostgreSQL to the stack. TimescaleDB is in use by some of our customers already and should therefore be part of the MDP stack as well.
Use this image: https://hub.docker.com/r/timescale/timescaledb/
Add Zeek Network Security Monitor to stack as a streaming source and configure it to send data to Kafka. The docker image can be found here: https://hub.docker.com/r/blacktop/zeek
Add Presto to the stack.
For Presto:
presto-coordinator:
image: johannestang/prestodb:0.215
restart: always
ports:
- "8080:8080"
environment:
S3_ACCESS_KEY: ${MINIO_ACCESS_KEY}
S3_SECRET_KEY: ${MINIO_SECRET_KEY}
S3_ENDPOINT: "http://minio:9000"
Add Presto to the Hue configuration (checkout https://github.com/johannestang/bigdata_stack/blob/master/docker-compose.yml)
Revise the usage of KAFKA_ADVERTISED_LISTENERS according to Robin's blog: https://rmoff.net/2018/08/02/kafka-listeners-explained/
Configure Hive to Use the LDAP provided with the Stack.
http://tate.cx/configuring-hive-hiveserver2-to-use-ldap-or-ldaps-for-authentication/
We can use the labels
section of the docker-compose.yml
to add metadata. One use case is to specify the URLs for Web UI or REST APIs as labels, so that some documentation can be generated with hyperlinks to the service UIs and APIs.
The following metadata should be added
For APIs
com.mdps.service.restapi.url
- the URL of the REST API provided by the service
com.mdps.service.restapi.name
- the name of the service
For UIs
com.mdps.service.webui.url
- the URL of the WEB UI provided by the service
com.mdps.service.webui.name
- the name of the WEB UI provided by the service
A service can also specify both, if he offers both a REST API and a WEB UI
Add the Streamsets Edge container to the stack, so that it can be used as well for testing.
If you'd like to anyway use the region that supports Signature V4 in spark you can pass flag -Dcom.amazonaws.services.s3.enableV4
to the driver options and executor options on runtime. For example:
spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
... (other spark options)
With this settings Spark is able to write to Frankfurt (and other V4-only regions) even with not-so-fresh AWS sdk version (com.amazonaws:aws-java-sdk:1.7.4
in my case)
Rename service as in the long term broker-N
is not the best name for the service. We also support MQTT brokers ....
Currently the number of Zookeeper nodes is hardcoded to 1. Add support for multiple Zookeeper nodes (3). See this docker-compose.yml
for reference: https://github.com/confluentinc/cp-docker-images/blob/5.3.1-post/examples/kafka-cluster/docker-compose.yml
When using Zeppelin Spark Interpreter to access Spark, I always get a "bucket does not exists" error.
Add the https://hub.docker.com/_/python container to the stack
Currently the data-transfer
folder is mapped into some of the services (where it makes sense). This should be configurable, so that it can be disabled.
Currently Spark does not work properly, when the property spark.sql.catalogImplemenation
is set to hive
The Spark Interpreter throws an exception, when upon initializing the session, dependencies have to be downloaded due to using the spark.jars.packages
property. This happens when the dependencies for AWS integration are added.
We should be able to specify the OS Architecture when generating the docker-compose.yml, so that the correct containers are being used for the given platform.
We should support x86-64 and arm (for Raspberry PIs)
Currently the spark.history.fs.logDirectory and spark.eventLog.dir in spark-defaults.conf is hardcoded to use HDFS. This way spark is dependent on a full Hadoop Stack.
It would be better if for the log we could choose:
The mdps_generate.sh
script currently works for Linux and Mac.
We need support for Windows in the same way, so that it can be downloaded and run from Windows.
At the moment themdps_generate
script assumes that the 1st parameter references the custom.yml
using an absolute path.
For automatic, script deployment scenarios it would nice, if an URL could be passed as well. By that the config.yml could be held anywhere on the internet (i.e. as a simple gist) and would be downloaded by the mdps_generate
script or inside the docker container which is run by the script.
This would be helpful to setup a simple stack on Lightsail.
When starting the container after upgrading to 0.8.2 the UI will no longer show when entering the URL on the browser.
Test again setting the Schema-Registry URL in SDC to http://schema-registry:8089 instead of http://analyticsplatform:8089.
This worked on the docker environment within a VM, but not on Docker for Mac.
There are some documentation pages in the Static Stack which should be moved to the doc
folder.
Add the Node-RED to the platform with support for both amd64 and arm architectures. Use the this image: https://hub.docker.com/r/nodered/node-red
Add the InfluxDB 2.0 service.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.