Giter Site home page Giter Site logo

dbt-databricks-c360's Introduction

dbt on Databricks demo


This content demo how Databricks can run dbt pipelines, integrated with Databricks Workflow.

This demo replicate the Delta Live Table (DLT) pipeline in the lakehouse c360 databricks demo available in dbdemos.install('lakehouse-retail-c360')

Running dbt on Databricks

This demo is part of dbdemos.ai dbt bundle.
Do not clone this repo directly.

Instead, to install the full demo with the worfklow and repo, you can run:

%pip install dbdemos
dbdemos.install('dbt-on-databricks')

The best way to run production-grade dbt pipeline on Databricks is as a Databricks Workflow dbt Task.

Here is an overiew of the workflow created by dbdemos:


Task 02 is a dbt task running on Databricks workflow directly.

Running dbt + databricks locally

pip install dbt-databricks
export DBT_DATABRICKS_HOST=xxxx.cloud.databricks.com  
export DBT_DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/xxx 
export DBT_DATABRICKS_TOKEN=dapixxxxx 
dbt run

Project structure

This demo is broken up into the following building blocks. View the sub-folders in in the sequence indicated below to help you understand the overall flow:

  • 01-ingest-autoloader

    • This contains the notebook to ingest raw data incrementally into our Lakehouse (not a dbt task)
    • The goal is to ingest the new data once it is uploaded into cloud storage, so our dbt pipeline can do the transformations
    • It is worth noting that while dbt has a functionality called seed that allows files to be loaded, it is currently limited to CSV files
  • dbt_project.yml

    • Every dbt project requires a dbt_project.yml file - this is how dbt knows a directory is a dbt project
    • It contains information such as connection configurations to Databricks SQL Warehouses and where SQL transformation files are stored
  • profiles.yml

    • This file stores profile configuration which dbt needs to connect to Databricks compute resources
    • Connection details such as the server hostname, HTTP path, catalog, db/schema information are configured here
  • models

    • A model in dbt refers to a single .sql file containing a modular data transformation block
    • In this demo, we have modularized our transformations into 4 files in accordance with the Medallion Architecture
    • Within each file, we can configure how the transformation will be materialized - either as a table or a view
  • tests

    • Tests are assertions you make about your dbt models
    • They are typically used for data quality and validation purposes
    • We also have the ability to quarantine and isolate records that fail a particular assertion
  • 03-ml-predict-churn

    • This contains the notebook to load our churn prediction ML model from MLFlow after the dbt transformations are complete (not a dbt task)
    • The model is loaded as a SQL function, then applied to the dbt_c360_gold_churn_features that will be materialized at the end of the second dbt task in our workflow
  • seeds

    • This is an optional folder used to store sample, adhoc CSV files to be loaded into the Lakehouse. The seeds aren't used in the default setup (we use the ingestion with the autoloader instead)

Feedback


Got comments and feedback?
Feel free to reach out to [email protected] or quentin.ambard.databricks.com

dbt-databricks-c360's People

Contributors

quentinambard avatar mchan-github-demo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.