Giter Site home page Giter Site logo

kreuzlaker's Introduction

kreuzLaker

kreuzlaker_logo

Developing a central data platform that leverages many different sources of an organization's data assets is not a trivial task. A lot of work must be done to ensure high quality, availability and security of the data, but also the scalability of the platform over time.

kreuzLaker provides a blueprint for developers to help them build a secure, data lake-based analytics platform on AWS. While the optimal solution always depends on the specific use case and the data to be provided, kreuzLaker offers modular components that can greatly ease the start of a data-related project.

Currently kreuzLaker is still in alpha version. In this article, we want to show you which components are included for the open source start of the project.

Batch Architecture

xw batch architecture

As the start architecture of the kreuzLaker project, clients can use various services such as DMS, Appflow, or Lambda to ingest raw data from various sources into the raw area of the data lake S3 Bucket.

AWS Glue, a serverless integration service, is used to clean this data and store into the “converted” area of the data lake S3 Bucket, usually in the form of parquet. Next we can perform business transformations to the data via Glue or dbt and load it into a bucket which we call the “data sets” area of the data lake S3 Bucket. dbt is the cornerstone of the “modern data stack” (we wrote already about it) because it enables more people to contribute to the data platform without deep knowledge of the underlying technology as long as they know SQL and, of course, the data itself. In this project, dbt is executed against the serverless query service Amazon Athena. On top of these Buckets, the Glue Data Catalog stores metadata about the actual data in all the buckets. It provides a uniform view of the data residing in the Data Lake and enables users and applications to discover data more efficiently. The catalog gets updated by Glue Crawlers (for the raw data) and during the cleaning and transformation jobs. AWS Lake Formation is used to secure and govern the data lake, making data only available to the users who should be allowed to see it down to row or cell level. With the help of the data catalog, customers can plug in various AWS tools such as Athena, Sagemaker or Redshift. The first version of kreuzLaker will have Apache Superset, an open source visualization tool, but can be exchanged with other Amazon QuickSight or third-party tools.

kreuzLaker is implemented with the AWS Cloud Development Kit (CDK), an IaC tool, such that reproducibility, testing, as well as facilitated and easy further development are guaranteed. To this end, Python has been chosen to define the cloud infrastructure as it is one of the most common programming languages in the data domain.

Data lake architectures on AWS tend to show similar patterns in terms of the components used and how they handle different data workflows. With modularity in mind, our team came up with the idea of bringing these patterns together in a single place so that they can be customized for each customer. The blueprints significantly reduce development time for a robust data lake because they can be easily deployed using the AWS CDK. Platform users such as Data Engineers, Data Scientists, and Data Analysts need to worry less about setting up these core components and can instead focus on their core activities.

See xw-batch/README.md how to get started with kreuzLaker.

kreuzlaker's People

Contributors

jankatins avatar fabdy avatar sosafe-john-nguyen avatar jolo-dev avatar arbrush avatar

Stargazers

Georvic Tur avatar  avatar Duc Nguyen avatar nicor88 avatar Jan Kyri avatar Johannes Konings avatar  avatar  avatar  avatar  avatar  avatar

Watchers

Jonatan Reiners avatar Rüdiger Scheumann avatar Ashik Salahudeen avatar James Cloos avatar  avatar Ismael Jimoh avatar nicor88 avatar Torsten Dreyer avatar Saba Sabrin avatar Manuel Vogel avatar Björn avatar Saleh avatar Bujewski avatar  avatar  avatar Maximilian Fischer avatar

kreuzlaker's Issues

Add appropriate permissions to User groups to list required buckets

Currently, users added to the DataLakeDebugging group cannot list Buckets, meaning they cannot see Buckets in the S3 console (even though they have access to these). It would be great if they could list these buckets (at least read permissions).

Moreover, users in the DataLakeAthenaUser group cannot list the query results bucket. However users can see historic and saved query results in the Athena console. It might make sense to give them access to the Bucket. We have to align on this.

Tasks:

  • give DataLakeDebugging group access to list all buckets
  • align whether DataLakeAthenaUser group should be able to list the query results bucket
  • if yes: allocate the correct permissions to this group

Add BI Tool

Possible candidates:

metabase, superset (open source, no chaching)
quicksight + SPICE
Tableau

Tasks:

Decide one
Setup service
Setup dashboards

Add a Notebook solution

For data science, adding a notebook solutions is a nice way to prototype transformations or build one off analyses.

Tasks:

Figure out how to build a notebook environment (likely some hints in https://aws.amazon.com/blogs/startups/a-data-lake-as-code-featuring-chembl-and-opentargets/)
Implement it in cdk
If AWS has a Jupyter service:
Add group for this and add example user to it
Setup jupyter service in CDK and give that group access to it

DoD:

The example user (or a different user separate form the deployment user) can open a notebook and access the data lake via some python or sql code

Move scoofy data generation into this repo

Currently the scoofy data generation happens outside of this project (basically a lambda + a s3 bucket) and needs a cross account setup to copy data into the data lake. It makes sense to simply add the s3 data directly into the raw data lake bucket, similar to what currently the copy job does. This would also mirror what we expect to happen in the cdc-from-RDBMS case (#7).

Tasks:

  • Remove the copy job
  • Add the data generation lambda to put data into the same place

DoD:

  • Data generation happens here

(see also #21 for adjusting the current setup in multiple ways)

Add static checks for dbt and oidc parts into the pipeline

The dbt part and the OIDC part are currently not tested/linted during CICD. We want that...

  • dbt: sql format, maybe more?
  • OIDC: run tests, some typescript linter?

DoD:

Static linters for dbt and OICDC parts re run as part of the github pipeline

Integrate Lake Formation into the project

We want integrate AWS Lake Formation into our current stack via CDK. As L2-constructs do not exist, we will have to write our own.

Create the following stack:

IAM User: lf-datalake-admin
appropriate IAM roles and assign them to the user
assign the IAM user as the data lake admin
give IAM user permissions (in LF) to create databases/tables
register our three buckets to be managed by Lake Formation as the data lake
two analyst users
one with visibility to PII data
one without access/visibility to PII data

Create a ETL Construct

Create an ETL construct consisting of

  • Glue Crawler
  • Glue Jobs (with Tables in Data Catalog)
  • Glue Catalog

Both will create a new Job/Crawler whenever there are new transformations.

This depends on #30.
This is linked to #16 #20.

Create Data Lake Construct

Create a Data Lake construct consisting of

  • raw- S3- Bucket with option to have encryption
  • clean - S3- Bucket with option to have encryption
  • Database in Data Catalog
  • Bucket Policies

This is linked to #16 #20.

Refactor `xw_batch_stack.py`

Currently, xw_batch_stack.py is too big.

Goal:
Split the code in the xw_batch_stack.py into smaller constructs.

Add example for PII data handling

Current idea:

  • Wait for Lake Formation getting implmented (#8)
  • Add a second glue job which pulls PII data out of the orginal data: `(id, hash, unhashed data)
  • Add a second dbt job which transforms this into a reasonable dataset useable, by e.g. a CRM team or so... (NOT analytics!)
  • Put restriction on that data who can access it (e.g. a second superset account as an example?)

DoD:

  • we have an example of PII usage

Integrate RDBMS-via-cdc data into the data lake

A very common pattern is probably a PG/MySQL database which holds the transactional data and which should be inputted into the data lake. For that we would like to have a PG DB which regularly gets some data changes (lambda) and streams these changes into the data lake into a raw-raw s3 place (via data DMS). From there we again want to transform these to an event table in parquet. It would also be nice to have a way to get the latest info per table (e.g. a query which uses a primary key and gets the latest row for that).

DoD

  • We have a PG data source which delivers data via cdc into the data lake into the "converted" place

Change Scoofy data generation & S3 cost model

(This happens outside of the current codebase!)

  • delete all current data
  • decrease daily data generation for Scoofy data set
  • change S3 cost model: to that the data consumer pays for the requests (instead of scoofy bucket owner)

DoD:

  • significantly less new day is generated
  • consumers (data requesters) pay for the GET requests

Refactor stack to delete all user generated data

Currently, custom Athena workgroups and the dbt artifact repo are not deleted during a cdk destroy when there is user data within that workgroup. Refactor the stack flag for the dev environment to delete the workgroup automatically on destroy. Rename the stack flag so that it better fits the purpose.

Tasks:

  • rename the current stack flag to force_delete_flag
  • use this flag to enable the recursive_delete_option for the workgroup for the dev stack (& disabled for the prod stack)
  • use this flag to delete the dbt artifact repo

Resources: CfnWorkGroup

Add fake credit card data from scoofy example data into the glue job

We added PII data so we can show how that is soposed to be handled. This is in the raw data, but not selecting it when doing the parquet conversion.

Tasks:

  • Figure out a hash functions for glue (ask @jankatins, he has some examples ...)
  • Add it to the glue job which transforms the scoofy data
  • Push the hash through in dbt?

DoD:

  • hashed data is available in at least one the final dataset (e.g. latest credit card hash)

Integrate cost data

There is a cost report which can be delivered as a data set in s3. Lets integrate that:

Blog post: https://www.wellarchitectedlabs.com/cost/200_labs/200_cloud_intelligence/cost-usage-report-dashboards/dashboards/2a_cost_intelligence_dashboard/

Deployment via CLI and cloud formation: https://github.com/aws-samples/aws-cudos-framework-deployment/
Terraform version: https://github.com/nuuday/terraform-aws-cur

Tableau report directly on top of the data: https://www.tableau.com/about/blog/2019/3/monitor-aws-cloud-cost-usage-102383
Rough idea: add this as a data source which is exposed in athena and quicksight + add the dashboard from the blog post.

Research GDPR/BDSG compliant deletion/anonymization of user records

The German Privacy Act (Bundesdatenschutzgesetz – BDSG) and the General Data Protection Regulation (GDPR) provides rules for data processing of user data. Exemplary is §75 BDSG where user data has to be deleted, if it is no longer necessary for the purpose of the tasks. Alternatively, there are some laws where anonymization of user data is sufficient, meaning that information cannot trace back to specific persons.

While it is comparatively easy to delete records from transactional databases, it turns out to be a bit more complicated in a data lake setup. We have to research about the possible approaches, such as making use of tabular data formats (Apache Iceberg, Apache Hudi, Delta Lake or Lake Formation Governed Tables) enabling deletions/inserts/updates or making use of S3 Lifecycle Policies.

Tasks:

Research the possible approaches to delete user records in a Data Lake setup
Discuss findings with the team

Add great expectations as a data validation framework

https://greatexpectations.io/ is a framework to test expectations against data ("Column should never be NULL", "Values should be between x and y",...) which is great for checking that input data and also output data keep the quality you expect.

https://docs.greatexpectations.io/docs/deployment_patterns/how_to_use_great_expectations_in_aws_glue/
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/database/athena/ 

Tasks

Add it into the dev stack for business transformations
Run it during the dbt (and glue?) runs

DoD:

GE is run automatically during dbt runs

Create a Business Transformation Construct

Create a Business Transformation construct consisting of option

  • Glue Job
  • dbt- Fargate Service
  • target S3-Bucket

The user can have the option to use either Glue Job or dbt- Fargate Service.

This depends on #30.
This is linked to #16 #20.

Refactor CI pipleine to use github actions

  • Replace OIDC stack to use github (make sure only main can access the sandbox!)
  • (move the the gitlab OIDC stack into it's own repo)
  • Port gitlab pipeline to github actions

DoD:

  • Pipeline runs on github actions and can diff against our sandox account

Add machine learning as a data product

Would be interesting to integrate a basic machine learning solutions, e.g. a regression for scoofy data (its generated from a formula, so linear regression on transformed data should work nicely)

Tasks

Pick a nice and easy, but still representative ML use case to do prediction
Implement it

DoD:

We have a machine learning usecase covered

Cleanup source data

The example source is too big and costs happen in the source account.

Tasks:

  • Delete all current data to start fresh
  • Reduce the amount of data per day (daily, less lines) -> it should be daily to make the glue jobs run each day
  • Configure the source s3 bucket to let the puller pay for the data copy

DoD

  • The puller pays for data copy
  • The data is smaller

Configure logs to get expired

Currently log retention is forever, this should be changed to something reasonable.

DoD:

Logs get deleted after a certain time (1 month?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.