Giter Site home page Giter Site logo

sungchun12 / fst Goto Github PK

View Code? Open in Web Editor NEW
31.0 4.0 1.0 504 KB

fst: flow state tool | smooth where you want it, friction where you need it when data engineering

Home Page: https://www.loom.com/share/ecfbdfb981e4443d94d2c95f16176118

License: Apache License 2.0

Python 100.00%
dbt fast flowstate hot-reload workflow

fst's Introduction

fst: flow state tool]

fst(flow state tool): A tool to help you stay in flow state while developing dbt models.

Let's make it the overwhelming normal that these questions are answered in seconds or less when engineering data(think: you don't need 10+ tabs and 40+ mouse clicks to do your jobs)

Questions to Answer
  • Who else is touching this precious file of mine? I’m tired of pull request clashing

  • What’s historical performance on this and am I beating it?

  • How often does this fail in production?

  • Who uses this model and how often?

  • What dashboards will this help vs. hurt?

  • What’s a data preview based on my updates look like? (e.g. 5 rows)

  • How many scheduled data pipelines are tied to this model?

  • How much does this cost to run in production and am I helping vs. hurting?

  • What are existing database permissions on this model?

  • Anyone working on pull requests in real time that rely on my work?

  • What’s a data diff compared to current production data?

Note: This tool is still in development. Please feel free to contribute to this project.

Description

This is a file watcher for a dbt project using duckdb as the database. It runs dbt build and a query preview on a SQL file when it detects a modification. It also generates a test file for the modified SQL file if tests are not detected.

It works with any SQL file within the models/ directory of the dbt project. You must run this tool from the root directory of the dbt project.

You'll notice for the sake of MVP, I am running nested git clones to get this working. I'll release to pypi soon.

# my command to run this tool in an infinite loop in a split terminal
git clone https://github.com/sungchun12/fst.git
cd fst
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
source venv/bin/activate
pip install -r requirements.txt # installs the dbt dependencies
pip install -e ../ # installs the fst package locally
dbt build # Create the duckb database file and get commands working
# open up your IDE or another terminal to start the fst workbench
fst start
# example of running this tool on each modification to any SQL file within the `models/` directory
# pro tip: open up the compiled query in a split IDE window for hot reloading as you develop
2023-04-12 11:30:15 - INFO - Running `dbt build` with the modified SQL file (/Users/sung/fst/jaffle_shop_duckdb/models/new_file.sql)...
2023-04-12 11:30:21 - INFO - `dbt build` was successful.
2023-04-12 11:30:21 - INFO - 18:30:20  Running with dbt=1.4.5
18:30:20  Found 7 models, 22 tests, 0 snapshots, 0 analyses, 297 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
18:30:20  
18:30:20  Concurrency: 24 threads (target='dev')
18:30:20  
18:30:20  1 of 1 START sql table model main.new_file ..................................... [RUN]
18:30:21  1 of 1 OK created sql table model main.new_file ................................ [OK in 0.26s]
18:30:21  
18:30:21  Finished running 1 table model in 0 hours 0 minutes and 0.53 seconds (0.53s).
18:30:21  
18:30:21  Completed successfully
18:30:21  
18:30:21  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

2023-04-12 11:30:21 - WARNING - Warning: No tests were run with the `dbt build` command. Consider adding tests to your project.
2023-04-12 11:30:21 - WARNING - Generated test YAML file: /Users/sung/fst/jaffle_shop_duckdb/models/new_file.yml
2023-04-12 11:30:21 - WARNING - Running `dbt test` with the generated test YAML file...
2023-04-12 11:30:28 - INFO - `dbt test` with generated tests was successful.
2023-04-12 11:30:28 - INFO - 18:30:26  Running with dbt=1.4.5
18:30:27  Found 7 models, 24 tests, 0 snapshots, 0 analyses, 297 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
18:30:27  
18:30:27  Concurrency: 24 threads (target='dev')
18:30:27  
18:30:27  1 of 2 START test not_null_new_file_customer_id ................................ [RUN]
18:30:27  2 of 2 START test unique_new_file_customer_id .................................. [RUN]
18:30:27  1 of 2 PASS not_null_new_file_customer_id ...................................... [PASS in 0.20s]
18:30:27  2 of 2 PASS unique_new_file_customer_id ........................................ [PASS in 0.21s]
18:30:27  
18:30:27  Finished running 2 tests in 0 hours 0 minutes and 0.44 seconds (0.44s).
18:30:27  
18:30:27  Completed successfully
18:30:27  
18:30:27  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

2023-04-12 11:30:28 - INFO - Executing compiled query from: /Users/sung/fst/jaffle_shop_duckdb/target/compiled/jaffle_shop/models/new_file.sql
2023-04-12 11:30:28 - INFO - Using DuckDB file: jaffle_shop.duckdb
2023-04-12 11:30:28 - INFO - `dbt build` time: 6.76 seconds
2023-04-12 11:30:28 - INFO - Query time: 0.00 seconds
2023-04-12 11:30:28 - INFO - Result Preview
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|   customer_id | first_name   | last_name   | first_order   | most_recent_order   |   number_of_orders |   customer_lifetime_value |
+===============+==============+=============+===============+=====================+====================+===========================+
|             1 | Michael      | P.          | 2018-01-01    | 2018-02-10          |                  2 |                        33 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             2 | Shawn        | M.          | 2018-01-11    | 2018-01-11          |                  1 |                        23 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             3 | Kathleen     | P.          | 2018-01-02    | 2018-03-11          |                  3 |                        65 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             6 | Sarah        | R.          | 2018-02-19    | 2018-02-19          |                  1 |                         8 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             7 | Martin       | M.          | 2018-01-14    | 2018-01-14          |                  1 |                        26 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
2023-04-12 11:30:28 - INFO - fst metrics saved to the database: fst_metrics.duckdb

Note: Tested with python version: 3.8.9 on MacOs Intel

fst's People

Contributors

sungchun12 avatar xtomflo avatar

Stargazers

 avatar Matt Niedelman avatar Charlie Ma avatar Gavin avatar Ray avatar Markus Rauhalahti avatar Max Halford avatar 103.cloud avatar  avatar  avatar Leah Nguyen avatar Robert Yi avatar  avatar Abi Adebayo avatar Josh Wills avatar Jack Forgash avatar  avatar Chris Boden avatar Dat Nguyen avatar Toby Mao avatar Ryan Jenkinson avatar Callum McCann avatar David Harting avatar  avatar  avatar Brandon Beidel avatar  avatar  avatar winnie avatar Benoit Perigaud avatar Kyle Wigley avatar

Watchers

 avatar  avatar Kostas Georgiou avatar  avatar

Forkers

xtomflo

fst's Issues

Blueprint general design

To better organize the given script into modular components, separate files where needed, and make it fully performant and maintainable, I would recommend the following structure:

  1. main.py: This file will include the main entry point, command-line functionality using Click, and import functions from other modules.
  2. logger.py: This file will include the logging setup and related functions.
    • setup_logger()
  3. query_handler.py: This file will contain the class and related functions for handling file events and executing queries.
    • class QueryHandler(FileSystemEventHandler)
    • handle_query(query, file_path)
  4. file_utils.py: This file will include all file-related utility functions.
    • get_active_file(file_path: str)
    • find_compiled_sql_file(file_path)
    • get_model_name_from_file(file_path: str)
    • generate_test_yaml(model_name, column_names, active_file_path)
    • get_model_paths()
  5. db_utils.py: This file will include all database-related utility functions.
    • execute_query(query: str, db_file: str)
    • get_duckdb_file_path()
    • get_project_name()
  6. directory_watcher.py: This file will include functions related to watching the directory for changes.
    • watch_directory(directory: str, callback, active_file_path: str)
# main.py
import click
import os
from logger import setup_logger
from query_handler import handle_query
from directory_watcher import watch_directory
from file_utils import get_active_file, get_model_paths

@click.group()
def main():
    pass

@main.command()
# ... (the remaining Click commands as you provided)

if __name__ == "__main__":
    main()

This modular approach will make the code more maintainable, as each module focuses on a specific functionality. Additionally, it allows for easier testing and optimization of individual components, leading to improved performance.

Clearly scope problems to solve with v1

The main goal is to invoke the playful and productive look and feel of hot reloading. When a frontend engineer updates colors or buttons in their javascript, their website instantly reflects those changes. It builds momentum and gives directional correctness. No extra commands to run. Just staying in flow.

When a user opens up a dbt project they should go through simple steps:

  1. run fst start
  2. Any file they touch with SQL models should be detected for modifications and run the below with extra signals that the program is doing exactly as expected
  3. dbt build -s <model name>
  4. duckdb preview query table

Considerations:

  • set the preview limit row setting: ex: 5 rows, 10 rows, 100, etc. any value that's an integer
  • If no tests exist, prompt the user if they want to generate and rerun with tests. Maybe make this a config flag to disable upfront. enabled by default
  • save results to a markdown file in a flow_state/ subfolder so it can constantly reload results. example: flow_state/new_file.md, flow_state/customers.md

Write passing tests

This needs to be robust if strangers are going to trust this package. I'll need to focus on duckdb database locking issues that happen occasionally.

Checkout https://facebook.github.io/watchman/ and see if that's better than watchdog

Watchman and Watchdog are both popular tools for detecting file modifications in a directory, but they have some differences in terms of performance, platform support, and implementation.
Watchman:
Developed and maintained by Facebook.
Written in C++, providing better performance in large projects.
Uses native operating system APIs for file system monitoring, making it highly efficient.
Supports Linux, macOS, Windows, and FreeBSD.
Requires installation of a separate service (the Watchman daemon) that runs in the background.
Comes with a command-line interface and JSON-based IPC protocol for integration with other tools.
Extensively used in large projects and by large organizations, indicating its reliability and stability.
Watchdog:
A Python library, which may not provide as good performance as Watchman, especially for large projects.
Cross-platform, supporting Linux, macOS, and Windows.
Uses native operating system APIs for file system monitoring where available (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows), and falls back to a polling-based approach when native APIs are not available.
Easier to integrate into Python applications, as it's a Python library.
No need to install or manage a separate service.
Might be less efficient and slower than Watchman when monitoring large directory trees.
In summary, Watchman tends to be more performant, especially for large projects, and it supports more platforms (including FreeBSD). However, Watchdog is easier to integrate into Python applications and doesn't require a separate service to be installed and managed. Both tools are platform-agnostic, but Watchman is generally considered more efficient and reliable in large-scale projects due to its lower-level implementation in C++ and its use of native operating system APIs for file system monitoring.

get a data diff between dev and prod

Research

https://medium.com/snowflake/minimalist-snowflake-table-compare-using-data-diff-ba67cc4f904c

  • look at audit helper because datadiff has a lot of rough edges to it
  • use dbt 1.5+ to run invocations
  • update compare to prod chart to update y axis for row counts, bytes, other stats
  • have audit helper do hot reloading in the logs and store invocations automatically instead of a dropdown UI:
    https://hub.getdbt.com/dbt-labs/audit_helper/latest/
  • I'll probably have only one of these functions for the sake of MVP
  • compare to dev to prod only. That's where it's most useful: compare_relation_columns
  • set defaults for file generation, look at profiles.yml for schema info
  • store historical audit reports for each iteration

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.