sungchun12 / fst Goto Github PK

View Code? Open in Web Editor NEW

31.0 4.0 1.0 504 KB

fst: flow state tool | smooth where you want it, friction where you need it when data engineering

Home Page: https://www.loom.com/share/ecfbdfb981e4443d94d2c95f16176118

License: Apache License 2.0

Python 100.00%

dbt fast flowstate hot-reload workflow

fst's Introduction

fst(flow state tool): A tool to help you stay in flow state while developing dbt models.

Let's make it the overwhelming normal that these questions are answered in seconds or less when engineering data(think: you don't need 10+ tabs and 40+ mouse clicks to do your jobs)

Questions to Answer

Who else is touching this precious file of mine? I’m tired of pull request clashing
What’s historical performance on this and am I beating it?
How often does this fail in production?
Who uses this model and how often?
What dashboards will this help vs. hurt?
What’s a data preview based on my updates look like? (e.g. 5 rows)
How many scheduled data pipelines are tied to this model?
How much does this cost to run in production and am I helping vs. hurting?
What are existing database permissions on this model?
Anyone working on pull requests in real time that rely on my work?
What’s a data diff compared to current production data?

Note: This tool is still in development. Please feel free to contribute to this project.

Description

This is a file watcher for a dbt project using duckdb as the database. It runs dbt build and a query preview on a SQL file when it detects a modification. It also generates a test file for the modified SQL file if tests are not detected.

It works with any SQL file within the models/ directory of the dbt project. You must run this tool from the root directory of the dbt project.

You'll notice for the sake of MVP, I am running nested git clones to get this working. I'll release to pypi soon.

# my command to run this tool in an infinite loop in a split terminal
git clone https://github.com/sungchun12/fst.git
cd fst
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
source venv/bin/activate
pip install -r requirements.txt # installs the dbt dependencies
pip install -e ../ # installs the fst package locally
dbt build # Create the duckb database file and get commands working

# open up your IDE or another terminal to start the fst workbench
fst start

# example of running this tool on each modification to any SQL file within the `models/` directory
# pro tip: open up the compiled query in a split IDE window for hot reloading as you develop
2023-04-12 11:30:15 - INFO - Running `dbt build` with the modified SQL file (/Users/sung/fst/jaffle_shop_duckdb/models/new_file.sql)...
2023-04-12 11:30:21 - INFO - `dbt build` was successful.
2023-04-12 11:30:21 - INFO - 18:30:20  Running with dbt=1.4.5
18:30:20  Found 7 models, 22 tests, 0 snapshots, 0 analyses, 297 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
18:30:20  
18:30:20  Concurrency: 24 threads (target='dev')
18:30:20  
18:30:20  1 of 1 START sql table model main.new_file ..................................... [RUN]
18:30:21  1 of 1 OK created sql table model main.new_file ................................ [OK in 0.26s]
18:30:21  
18:30:21  Finished running 1 table model in 0 hours 0 minutes and 0.53 seconds (0.53s).
18:30:21  
18:30:21  Completed successfully
18:30:21  
18:30:21  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

2023-04-12 11:30:21 - WARNING - Warning: No tests were run with the `dbt build` command. Consider adding tests to your project.
2023-04-12 11:30:21 - WARNING - Generated test YAML file: /Users/sung/fst/jaffle_shop_duckdb/models/new_file.yml
2023-04-12 11:30:21 - WARNING - Running `dbt test` with the generated test YAML file...
2023-04-12 11:30:28 - INFO - `dbt test` with generated tests was successful.
2023-04-12 11:30:28 - INFO - 18:30:26  Running with dbt=1.4.5
18:30:27  Found 7 models, 24 tests, 0 snapshots, 0 analyses, 297 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
18:30:27  
18:30:27  Concurrency: 24 threads (target='dev')
18:30:27  
18:30:27  1 of 2 START test not_null_new_file_customer_id ................................ [RUN]
18:30:27  2 of 2 START test unique_new_file_customer_id .................................. [RUN]
18:30:27  1 of 2 PASS not_null_new_file_customer_id ...................................... [PASS in 0.20s]
18:30:27  2 of 2 PASS unique_new_file_customer_id ........................................ [PASS in 0.21s]
18:30:27  
18:30:27  Finished running 2 tests in 0 hours 0 minutes and 0.44 seconds (0.44s).
18:30:27  
18:30:27  Completed successfully
18:30:27  
18:30:27  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

2023-04-12 11:30:28 - INFO - Executing compiled query from: /Users/sung/fst/jaffle_shop_duckdb/target/compiled/jaffle_shop/models/new_file.sql
2023-04-12 11:30:28 - INFO - Using DuckDB file: jaffle_shop.duckdb
2023-04-12 11:30:28 - INFO - `dbt build` time: 6.76 seconds
2023-04-12 11:30:28 - INFO - Query time: 0.00 seconds
2023-04-12 11:30:28 - INFO - Result Preview
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|   customer_id | first_name   | last_name   | first_order   | most_recent_order   |   number_of_orders |   customer_lifetime_value |
+===============+==============+=============+===============+=====================+====================+===========================+
|             1 | Michael      | P.          | 2018-01-01    | 2018-02-10          |                  2 |                        33 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             2 | Shawn        | M.          | 2018-01-11    | 2018-01-11          |                  1 |                        23 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             3 | Kathleen     | P.          | 2018-01-02    | 2018-03-11          |                  3 |                        65 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             6 | Sarah        | R.          | 2018-02-19    | 2018-02-19          |                  1 |                         8 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
|             7 | Martin       | M.          | 2018-01-14    | 2018-01-14          |                  1 |                        26 |
+---------------+--------------+-------------+---------------+---------------------+--------------------+---------------------------+
2023-04-12 11:30:28 - INFO - fst metrics saved to the database: fst_metrics.duckdb

Note: Tested with python version: 3.8.9 on MacOs Intel

fst's People

Contributors

Stargazers

Watchers

Forkers

xtomflo

fst's Issues

set the preview limit row setting: ex: 5 rows, 10 rows, 100, etc. any value that's an integer

Blueprint general design

To better organize the given script into modular components, separate files where needed, and make it fully performant and maintainable, I would recommend the following structure:

main.py: This file will include the main entry point, command-line functionality using Click, and import functions from other modules.
logger.py: This file will include the logging setup and related functions.
- setup_logger()
query_handler.py: This file will contain the class and related functions for handling file events and executing queries.
- class QueryHandler(FileSystemEventHandler)
- handle_query(query, file_path)
file_utils.py: This file will include all file-related utility functions.
- get_active_file(file_path: str)
- find_compiled_sql_file(file_path)
- get_model_name_from_file(file_path: str)
- generate_test_yaml(model_name, column_names, active_file_path)
- get_model_paths()
db_utils.py: This file will include all database-related utility functions.
- execute_query(query: str, db_file: str)
- get_duckdb_file_path()
- get_project_name()
directory_watcher.py: This file will include functions related to watching the directory for changes.
- watch_directory(directory: str, callback, active_file_path: str)

# main.py
import click
import os
from logger import setup_logger
from query_handler import handle_query
from directory_watcher import watch_directory
from file_utils import get_active_file, get_model_paths

@click.group()
def main():
    pass

@main.command()
# ... (the remaining Click commands as you provided)

if __name__ == "__main__":
    main()

This modular approach will make the code more maintainable, as each module focuses on a specific functionality. Additionally, it allows for easier testing and optimization of individual components, leading to improved performance.

Clearly scope problems to solve with v1

The main goal is to invoke the playful and productive look and feel of hot reloading. When a frontend engineer updates colors or buttons in their javascript, their website instantly reflects those changes. It builds momentum and gives directional correctness. No extra commands to run. Just staying in flow.

When a user opens up a dbt project they should go through simple steps:

run fst start
Any file they touch with SQL models should be detected for modifications and run the below with extra signals that the program is doing exactly as expected
dbt build -s <model name>
duckdb preview query table

Considerations:

set the preview limit row setting: ex: 5 rows, 10 rows, 100, etc. any value that's an integer
If no tests exist, prompt the user if they want to generate and rerun with tests. Maybe make this a config flag to disable upfront. enabled by default
save results to a markdown file in a flow_state/ subfolder so it can constantly reload results. example: flow_state/new_file.md, flow_state/customers.md

Use the native python API to run dbt programatically instead of subprocess

Research

https://docs.getdbt.com/reference/programmatic-invocations
Potential to have the workbench point out missing descriptions, tests, stylistic things, etc. by parsing the manifest
manifest is the only one supported for now

Write passing tests

This needs to be robust if strangers are going to trust this package. I'll need to focus on duckdb database locking issues that happen occasionally.

Checkout https://facebook.github.io/watchman/ and see if that's better than watchdog

Watchman and Watchdog are both popular tools for detecting file modifications in a directory, but they have some differences in terms of performance, platform support, and implementation.
Watchman:
Developed and maintained by Facebook.
Written in C++, providing better performance in large projects.
Uses native operating system APIs for file system monitoring, making it highly efficient.
Supports Linux, macOS, Windows, and FreeBSD.
Requires installation of a separate service (the Watchman daemon) that runs in the background.
Comes with a command-line interface and JSON-based IPC protocol for integration with other tools.
Extensively used in large projects and by large organizations, indicating its reliability and stability.
Watchdog:
A Python library, which may not provide as good performance as Watchman, especially for large projects.
Cross-platform, supporting Linux, macOS, and Windows.
Uses native operating system APIs for file system monitoring where available (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows), and falls back to a polling-based approach when native APIs are not available.
Easier to integrate into Python applications, as it's a Python library.
No need to install or manage a separate service.
Might be less efficient and slower than Watchman when monitoring large directory trees.
In summary, Watchman tends to be more performant, especially for large projects, and it supports more platforms (including FreeBSD). However, Watchdog is easier to integrate into Python applications and doesn't require a separate service to be installed and managed. Both tools are platform-agnostic, but Watchman is generally considered more efficient and reliable in large-scale projects due to its lower-level implementation in C++ and its use of native operating system APIs for file system monitoring.

save results to a markdown file in a flow_state/ subfolder so it can constantly reload results. example: flow_state/new_file.md, flow_state/customers.md

Make it really easy to follow code logic and entry and exit points
Bias towards one way data flow given streamlit reads the code top to bottom and most of the logic follows a linear path: https://dev.to/laserreindeer/one-way-data-flow-why-47fk
State travels from parent to child classes/objects

Create it as a command line tool and upload to pypi

Make installing feel elegant and guided

If no tests exist, prompt the user if they want to generate and rerun with tests. Maybe make this a config flag to disable upfront. enabled by default

Research

https://medium.com/snowflake/minimalist-snowflake-table-compare-using-data-diff-ba67cc4f904c

look at audit helper because datadiff has a lot of rough edges to it
use dbt 1.5+ to run invocations
update compare to prod chart to update y axis for row counts, bytes, other stats
have audit helper do hot reloading in the logs and store invocations automatically instead of a dropdown UI:
https://hub.getdbt.com/dbt-labs/audit_helper/latest/
I'll probably have only one of these functions for the sake of MVP
compare to dev to prod only. That's where it's most useful: compare_relation_columns
set defaults for file generation, look at profiles.yml for schema info
store historical audit reports for each iteration

sungchun12 / fst Goto Github PK

fst's Introduction

Description

fst's People

Contributors

Stargazers

Watchers

Forkers

fst's Issues

Research

Research

Recommend Projects

Recommend Topics

Recommend Org