kreuzwerker / kreuzlaker Goto Github PK

Dockerfile 1.47% Makefile 6.54% Shell 0.81% JavaScript 1.37% TypeScript 8.62% Python 80.77% Batchfile 0.42%

kreuzlaker's Introduction

kreuzLaker

Developing a central data platform that leverages many different sources of an organization's data assets is not a trivial task. A lot of work must be done to ensure high quality, availability and security of the data, but also the scalability of the platform over time.

kreuzLaker provides a blueprint for developers to help them build a secure, data lake-based analytics platform on AWS. While the optimal solution always depends on the specific use case and the data to be provided, kreuzLaker offers modular components that can greatly ease the start of a data-related project.

Currently kreuzLaker is still in alpha version. In this article, we want to show you which components are included for the open source start of the project.

Batch Architecture

As the start architecture of the kreuzLaker project, clients can use various services such as DMS, Appflow, or Lambda to ingest raw data from various sources into the raw area of the data lake S3 Bucket.

AWS Glue, a serverless integration service, is used to clean this data and store into the “converted” area of the data lake S3 Bucket, usually in the form of parquet. Next we can perform business transformations to the data via Glue or dbt and load it into a bucket which we call the “data sets” area of the data lake S3 Bucket. dbt is the cornerstone of the “modern data stack” (we wrote already about it) because it enables more people to contribute to the data platform without deep knowledge of the underlying technology as long as they know SQL and, of course, the data itself. In this project, dbt is executed against the serverless query service Amazon Athena. On top of these Buckets, the Glue Data Catalog stores metadata about the actual data in all the buckets. It provides a uniform view of the data residing in the Data Lake and enables users and applications to discover data more efficiently. The catalog gets updated by Glue Crawlers (for the raw data) and during the cleaning and transformation jobs. AWS Lake Formation is used to secure and govern the data lake, making data only available to the users who should be allowed to see it down to row or cell level. With the help of the data catalog, customers can plug in various AWS tools such as Athena, Sagemaker or Redshift. The first version of kreuzLaker will have Apache Superset, an open source visualization tool, but can be exchanged with other Amazon QuickSight or third-party tools.

kreuzLaker is implemented with the AWS Cloud Development Kit (CDK), an IaC tool, such that reproducibility, testing, as well as facilitated and easy further development are guaranteed. To this end, Python has been chosen to define the cloud infrastructure as it is one of the most common programming languages in the data domain.

Data lake architectures on AWS tend to show similar patterns in terms of the components used and how they handle different data workflows. With modularity in mind, our team came up with the idea of bringing these patterns together in a single place so that they can be customized for each customer. The blueprints significantly reduce development time for a robust data lake because they can be easily deployed using the AWS CDK. Platform users such as Data Engineers, Data Scientists, and Data Analysts need to worry less about setting up these core components and can instead focus on their core activities.

See xw-batch/README.md how to get started with kreuzLaker.

kreuzlaker's People

Contributors

Stargazers

Watchers

Forkers

alidjanka gitafolabi

kreuzlaker's Issues

Add appropriate permissions to User groups to list required buckets

Currently, users added to the DataLakeDebugging group cannot list Buckets, meaning they cannot see Buckets in the S3 console (even though they have access to these). It would be great if they could list these buckets (at least read permissions).

Moreover, users in the DataLakeAthenaUser group cannot list the query results bucket. However users can see historic and saved query results in the Athena console. It might make sense to give them access to the Bucket. We have to align on this.

Tasks:

give DataLakeDebugging group access to list all buckets
align whether DataLakeAthenaUser group should be able to list the query results bucket
if yes: allocate the correct permissions to this group

Integrate cdk_nag as additional best practice tests

See https://pypi.org/project/cdk-nag/ and https://aws.amazon.com/blogs/devops/manage-application-security-and-compliance-with-the-aws-cloud-development-kit-and-cdk-nag/

-> Seems a nice way to check for security problems

DoD:

Integrate it into the testing system and gitlab pipeline

Add BI Tool

Possible candidates:

metabase, superset (open source, no chaching)
quicksight + SPICE
Tableau

Tasks:

Decide one
Setup service
Setup dashboards

Add a Notebook solution

For data science, adding a notebook solutions is a nice way to prototype transformations or build one off analyses.

Tasks:

Figure out how to build a notebook environment (likely some hints in https://aws.amazon.com/blogs/startups/a-data-lake-as-code-featuring-chembl-and-opentargets/)
Implement it in cdk
If AWS has a Jupyter service:
Add group for this and add example user to it
Setup jupyter service in CDK and give that group access to it

DoD:

The example user (or a different user separate form the deployment user) can open a notebook and access the data lake via some python or sql code

Move scoofy data generation into this repo

Currently the scoofy data generation happens outside of this project (basically a lambda + a s3 bucket) and needs a cross account setup to copy data into the data lake. It makes sense to simply add the s3 data directly into the raw data lake bucket, similar to what currently the copy job does. This would also mirror what we expect to happen in the cdc-from-RDBMS case (#7).

Tasks:

Remove the copy job
Add the data generation lambda to put data into the same place

DoD:

Data generation happens here

(see also #21 for adjusting the current setup in multiple ways)

Add static checks for dbt and oidc parts into the pipeline

The dbt part and the OIDC part are currently not tested/linted during CICD. We want that...

dbt: sql format, maybe more?
OIDC: run tests, some typescript linter?

DoD:

Static linters for dbt and OICDC parts re run as part of the github pipeline

Integrate Lake Formation into the project

We want integrate AWS Lake Formation into our current stack via CDK. As L2-constructs do not exist, we will have to write our own.

Create the following stack:

IAM User: lf-datalake-admin
appropriate IAM roles and assign them to the user
assign the IAM user as the data lake admin
give IAM user permissions (in LF) to create databases/tables
register our three buckets to be managed by Lake Formation as the data lake
two analyst users
one with visibility to PII data
one without access/visibility to PII data

Create a ETL Construct

Create an ETL construct consisting of

Glue Crawler
Glue Jobs (with Tables in Data Catalog)
Glue Catalog

Both will create a new Job/Crawler whenever there are new transformations.

This depends on #30.
This is linked to #16 #20.

Create Query Engine Construct

Create a Query Engine construct consisting of

Amazon Athena
Policies
Athena Workgroups
Results S3-Bucket

This depends on #30 and #31.
This is linked to #16 #20.

Create Data Lake Construct

Create a Data Lake construct consisting of

raw- S3- Bucket with option to have encryption
clean - S3- Bucket with option to have encryption
Database in Data Catalog
Bucket Policies

This is linked to #16 #20.

Refactor `xw_batch_stack.py`

Currently, xw_batch_stack.py is too big.

Goal:
Split the code in the xw_batch_stack.py into smaller constructs.

Add example for PII data handling

Current idea:

Wait for Lake Formation getting implmented (#8)
Add a second glue job which pulls PII data out of the orginal data: `(id, hash, unhashed data)
Add a second dbt job which transforms this into a reasonable dataset useable, by e.g. a CRM team or so... (NOT analytics!)
Put restriction on that data who can access it (e.g. a second superset account as an example?)

DoD:

we have an example of PII usage

Integrate RDBMS-via-cdc data into the data lake

A very common pattern is probably a PG/MySQL database which holds the transactional data and which should be inputted into the data lake. For that we would like to have a PG DB which regularly gets some data changes (lambda) and streams these changes into the data lake into a raw-raw s3 place (via data DMS). From there we again want to transform these to an event table in parquet. It would also be nice to have a way to get the latest info per table (e.g. a query which uses a primary key and gets the latest row for that).

DoD

We have a PG data source which delivers data via cdc into the data lake into the "converted" place

Change Scoofy data generation & S3 cost model

(This happens outside of the current codebase!)

delete all current data
decrease daily data generation for Scoofy data set
change S3 cost model: to that the data consumer pays for the requests (instead of scoofy bucket owner)

DoD:

significantly less new day is generated
consumers (data requesters) pay for the GET requests

Improve Readme

The current readme is very generic and should say what this repo is about. Basically an elevator pitch.

Good examples:

https://github.com/tiangolo/fastapi

DoD:

Write an elevator pitch and put that into the readme

Add pre-commit

Add pre-commit to enforce code style locally.

Refactor stack to delete all user generated data

Currently, custom Athena workgroups and the dbt artifact repo are not deleted during a cdk destroy when there is user data within that workgroup. Refactor the stack flag for the dev environment to delete the workgroup automatically on destroy. Rename the stack flag so that it better fits the purpose.

Tasks:

rename the current stack flag to force_delete_flag
use this flag to enable the recursive_delete_option for the workgroup for the dev stack (& disabled for the prod stack)
use this flag to delete the dbt artifact repo

Resources: CfnWorkGroup

Add fake credit card data from scoofy example data into the glue job

We added PII data so we can show how that is soposed to be handled. This is in the raw data, but not selecting it when doing the parquet conversion.

Tasks:

Figure out a hash functions for glue (ask @jankatins, he has some examples ...)
Add it to the glue job which transforms the scoofy data
Push the hash through in dbt?

DoD:

hashed data is available in at least one the final dataset (e.g. latest credit card hash)

Integrate cost data

There is a cost report which can be delivered as a data set in s3. Lets integrate that:

Blog post: https://www.wellarchitectedlabs.com/cost/200_labs/200_cloud_intelligence/cost-usage-report-dashboards/dashboards/2a_cost_intelligence_dashboard/

Deployment via CLI and cloud formation: https://github.com/aws-samples/aws-cudos-framework-deployment/
Terraform version: https://github.com/nuuday/terraform-aws-cur

Tableau report directly on top of the data: https://www.tableau.com/about/blog/2019/3/monitor-aws-cloud-cost-usage-102383
Rough idea: add this as a data source which is exposed in athena and quicksight + add the dashboard from the blog post.

Research GDPR/BDSG compliant deletion/anonymization of user records

The German Privacy Act (Bundesdatenschutzgesetz – BDSG) and the General Data Protection Regulation (GDPR) provides rules for data processing of user data. Exemplary is §75 BDSG where user data has to be deleted, if it is no longer necessary for the purpose of the tasks. Alternatively, there are some laws where anonymization of user data is sufficient, meaning that information cannot trace back to specific persons.

While it is comparatively easy to delete records from transactional databases, it turns out to be a bit more complicated in a data lake setup. We have to research about the possible approaches, such as making use of tabular data formats (Apache Iceberg, Apache Hudi, Delta Lake or Lake Formation Governed Tables) enabling deletions/inserts/updates or making use of S3 Lifecycle Policies.

Tasks:

Research the possible approaches to delete user records in a Data Lake setup
Discuss findings with the team

Add great expectations as a data validation framework

https://greatexpectations.io/ is a framework to test expectations against data ("Column should never be NULL", "Values should be between x and y",...) which is great for checking that input data and also output data keep the quality you expect.

https://docs.greatexpectations.io/docs/deployment_patterns/how_to_use_great_expectations_in_aws_glue/
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/database/athena/

Tasks

Add it into the dev stack for business transformations
Run it during the dbt (and glue?) runs

DoD:

GE is run automatically during dbt runs

Create a Business Transformation Construct

Create a Business Transformation construct consisting of option

Glue Job
dbt- Fargate Service
target S3-Bucket

The user can have the option to use either Glue Job or dbt- Fargate Service.

This depends on #30.
This is linked to #16 #20.

Schedule transformations using stepfunctions

Currently kreuzlaker uses cron lines to schedule tasks, but this might result in problems when tasks depend on each others. Step functions seem an nice solutions for this. Stepfunctions also seem to have a nice solution for notifications on failure.

E.g. see this presentation: https://speakerdeck.com/nicor88/dbt-serverless-how-to-run-dbt-in-your-aws-account?slide=18
-> Stepfunction example

Refactor CI pipleine to use github actions

Replace OIDC stack to use github (make sure only main can access the sandbox!)
(move the the gitlab OIDC stack into it's own repo)
Port gitlab pipeline to github actions

DoD:

Pipeline runs on github actions and can diff against our sandox account

Implement cheaper solution for DBT Fargate Task

"Create VPC with a single private subnet and add Athena Endpoint to the subnet"

Add machine learning as a data product

Would be interesting to integrate a basic machine learning solutions, e.g. a regression for scoofy data (its generated from a formula, so linear regression on transformed data should work nicely)

Tasks

Pick a nice and easy, but still representative ML use case to do prediction
Implement it

DoD:

We have a machine learning usecase covered

Refactor main cdk stack into constructs

The main stack is too big, it needs splitting into multiple constructs

DoD

The main stack is refactored into proper constructs

Cleanup source data

The example source is too big and costs happen in the source account.

Tasks:

Delete all current data to start fresh
Reduce the amount of data per day (daily, less lines) -> it should be daily to make the glue jobs run each day
Configure the source s3 bucket to let the puller pay for the data copy

DoD

The puller pays for data copy
The data is smaller

Refactor infra deletions for storage to use aspects

See this blog post:

https://blog.jannikwempe.com/mastering-aws-cdk-aspects

Basically the application of "remove everything if on dev env" can be done by iterating over all constructs and set the right things outside of the construct, e.g. on stack level. Same for tags...

DoD:

The adhoc thingies in the stack class are gone and replaced by something outside of the stack class which just iterates over everything...

Configure logs to get expired

Currently log retention is forever, this should be changed to something reasonable.

DoD:

Logs get deleted after a certain time (1 month?)

Mirror Gitlab project with Github

As a developer, I want to have the code in Gitlab automatically mirrored to Github.

Now, on Github (https://github.com/kreuzwerker/xw-data-toolkit).). Currently, private.

TODO: Test if the pushes to main also be mirrored.

kreuzwerker / kreuzlaker Goto Github PK

kreuzlaker's Introduction

kreuzLaker

Batch Architecture

kreuzlaker's People

Contributors

Stargazers

Watchers

Forkers

kreuzlaker's Issues

Recommend Projects

Recommend Topics

Recommend Org