re-data / re-data Goto Github PK

re_data - fix data issues before your users & CEO would discover them 😊

Home Page: https://docs.getre.io/latest/docs/start_here

License: Other

Python 1.97% Shell 0.05% HTML 91.80% JavaScript 0.06% CSS 0.22% TypeScript 5.80% Makefile 0.10%

data-monitoring data-analysis data-quality data-quality-monitoring open-source-tooling data-observability dataquality data-testing data-quality-checks dbt

re-data's Introduction

What is re_data?

re_data is an open-source data reliability framework for the modern data stack. 😊

Currently, re_data focuses on observing the dbt project (together with the underlying data warehouse - Postgres, BigQuery, Snowflake and Redshift).

Live demo

Check out our live demo of what re_data can do for you 😊

Getting started

Check our docs! 📓 📓 📓

Join re_data community on Slack

Support

If you like what we are building, support us! Star ⭐ re_data on Github.

License

re_data is licensed under the MIT license. See the LICENSE file for licensing information.

Contributing

We love all contributions 😍 bigger and smaller.

Check out the current list of issues here and see our Contributing Guide. Also, feel welcome to join our Slack and suggest ideas or set up a live session here.

re-data's People

Contributors

Stargazers

Watchers

Forkers

imreeciowy mateuszklimek jwarlander gab-microsoft arvicv zuba0 h5chauhan guicalare etacassiopeia noufal85 dlbas igorivanov mociarain lnicalo patrika1979 liamjtaylor arkady-emelyanov hishoss jessewei sbboakye maciejklimek wajih-o pyerbiz sweetpand genomicsnx stjordanis admariner noyrotbart pauljw28 casperhansen mifrill samgans maxenceroux patkearns10 re-cezpiw dejii hkuffel rparrapy mqtt-halogenki deepakdeepak5 donnyzhao sergey-vdovin wojtekidd stumelius hackstrap akinolu52 cezpiw adbmd o7s8r6 aesmin lumiqai just-hope-netizen zixi0825 karmitoherry yogeshkad teshager-mamo z3z1ma sundargopal britorj teslahenry gilbertabakahadjei enzomagalhaes landier singhald007 craigwilson-zoe raghavprabhu follyaka anuwatavis mbrukman yhoungdev rwatts3 sundarshankar89 dieselpowerby chirasmita-mallick funshoelias yu-iskw galaxyoneda explore-n-learn simitest mahmoudmosleh c0nn0rstevens liepieshov codep-ai franktub seruman andreax79 alexnikitchuk shakahl samox cloudnepal ahmedrad mtsadler-branch nicoekkart egorhm alexispvgm togobingi enriquecastellano wutao0914 bachng2017 tanglespace

re-data's Issues

Asynchronous data source import

Tell us about the problem you're trying to solve
Adding a new data source with several tables can take so much time on the data source view within the web UI, due to redata importing and configuring tables from the source.

Describe the solution you’d like
Make this table import a task queue using for eg. celery

I think the function to convert into an async task is the redata.checks.data_schema.check_for_new_tables which is used by redata.ui_admin.data_source.DataSourceView to import the data source from the ui.

[FEATURE] Customized empty values

Tell us about the problem you're trying to solve

Currently empty values are only recognized when text field is empty string: "". But sometimes it would be good to consider "N/A", "no value" and other custom options empty.

Describe the solution you’d like

This could be added as env variable for dbt project, which by default is empty string

[FEATURE] Add distinct/duplicate/unique metrics to re_data

Tell us about the problem you're trying to solve

Would like have to this kind of metrics:

distinct_values - different values which show up in a column example: ['aaa','bbb','ccc', 'ccc'] -> 3
distinct_rows - rows which are unique: ['aaa','bbb','ccc', 'ccc'] -> 2
duplicate_values - values which are not unique - ['aaa','bbb','ccc', 'ccc'] -> 1 ('ccc')
duplicate_rows - rows which are not unique - ['aaa','bbb','ccc', 'ccc'] -> 2

Those shouldn't be added automatically to every table (as computing them is quite heavy and we are not always interested in that, but we should add them as metric computed in our config file.

How about you don't spam people?

What's up with that?

Stop sending unsolicited email to other Github users to sell your product.

Add support for Vertica

How

Use sqlalchemy connector for this:

https://pypi.org/project/sqlalchemy-vertica/

Create new class inheriting from SqlAlchemy class, add Vertica specific types to it.

Testing

For testing that Vertical works use this docker image:
https://hub.docker.com/r/jbfavre/vertica

Add MongoDB support for Redata

How

It should be possible to use: https://pypi.org/project/sqlalchemy-mongobi/ for connecting to mongo.
Then inhering from Redata SQLAlchemy DB class and adding support for types supported in mongo.

Possibly use https://github.com/pajachiet/pymongo-schema for detecting schema

[FEATURE] Notify on Slack (other channels?) about new anomalies

Tell us about the problem you're trying to solve

Saving anomalies is great for analysis, but for quick response, it would be good to have a notification system for them

Describe the solution you’d like

It would be great if re_data notifies you on Slack/other channels? of new anomalies found.

Add ability to run custom checks on data

[FEATURE] Experiment with ML for alerts

Tell us about the problem you're trying to solve

Currently alerts are based on computing z-score for values from the past. It work quite good, but have limitations.

Describe the solution you’d like

https://facebook.github.io/prophet/ and similar libraries may give better results here.
One thing to remember about is speed of giving prediction.

Additional context

Not yet assuming actual production implementation, but comparing results for statistical models and current ML approach.

[FEATURE] Make some .env configuration editable on user profile page

Tell us about the problem you're trying to solve

Expecting to change .env file for some of the configs, it's not very comfortable.
It also requires to do stop/up of docker.

Describe the solution you’d like

Making it editable and stored in internal DB, would have better user experience.
This would need separate admin page to actually edit this types of settings.

Describe the alternative you’ve considered or used

Keeping it as is

Additional context

Some config parameters, like run time for airflow maybe not that easy to edit this way. It need to be checked if that's possible to adjust it for airflow during run time. Most of other settings (like regexp used for detecting time columns for example)
shouldn't be that problematic.

[FEATURE] Visualisations of data quality metrics

Tell us about the problem you're trying to solve

Currently re_data doesn't do visualisations of computed stats.
You can still do it yourself using your own BI tool and connecting it to re_data metrics table schema,
but it would be nice to have way of supporting this internally so that not work is required and visualisations can be adjusted to specific needs related to data quality topic.

Describe solution

Add application fetching data gathered in re_data schema and displaying those in visualisations

Add nullable to supported schema changes detection

What

Currently redata doesn't store information about this if given column in nullable. Adding that would be really helpful as it's one of quite often made changes on DB schema

[FEATURE] Add regex match / regex not match metrics

Tell us about the problem you're trying to solve

We want to know if table column contain values which we expected (this can be defined by some regexp).
re_data could check tables against this regexp and write how many correct/incorrect values it sees.

Describe the solution you’d like

New metrics: correct_count, correct_pr, incorrect_count, incorrect_pr which will store information about this.
There should be possibility to configure what correct means by passing regexp into re_data config.

Add automatic alerting based on past values in Grafana

Detecting json structure changes

Sometime schema stays the same, but JSON fields are different, detecting this situation would be helpful in a lot of cases

[FEATURE] configure schema for metadata tables

Tell us about the problem you're trying to solve
In our dbt project we use generate_schema_name to deploy models in multiple schemas different than the default one. Is there an easy way to set a schema for metadata tables (re_data_columns, re_data_tables)? i.e. different schema than the one defined in dbt profile.yml.

Describe the solution you’d like
Setting a config variable to define target schema.

Describe the alternative you’ve considered or used

Additional context
dbt: 0.20.1
re_data: 0.2.0

[FEATURE] Adding wrapper on a dbt test command

Tell us about the problem you're trying to solve

Dbt tests output may be hard to parse and some logic on top of it, could make using it easier

Describe the solution you’d like

Adding wrapper on dbt tests command can have benefits like:

adding notifications after tests (like Slack/Pagerduty etc.)
saving some of the tests artifacts to DB

Add ability to select checks run for specific tables/columns

Create DB table with this, and admin UI for editing it

Add support for generating metrics based on s3 data

What

Support for monitoring s3 data stored in Hive format eg.
x=1/date=2020/country=X/data.parquet

Admin UI for editing what's exactly monitored

One use case. is manually editing time_column which is currently chosen by a heuristic.

Add options to specify thresholds for alerts

What

Currently all alerting is based on zscore, but it was couple times mentioned that possibility to manually adjust
thresholds for alerting would be useful. Also in the future, we expect to have also different anomaly mechanism possible for tables.

How

This can be done by modifying check and adding new_column: alerts there.
Json for this column could look like this:

{
  'column_name': {
    'metric_name': {
      'check_for_anomly': 'redata.alerts.thresholds'
      'params': {
        'max': value
        'min': value
     }
    }
  }
}

Current code will assume that if that's not defined it normally runs z_score, but it gives possible to run other alerting functions in the future.

Add support for AWS Athena

Why

This enables using Redata on s3 with Athena configured

How

It's possible to query Athena through SQLAlchemy using: https://pypi.org/project/pyathena/
So idea is to create a data-source inheriting from SQLAlchemy one for that

[FEATURE] Show nulls and missing values percentages

Tell us about the problem you're trying to solve

Nulls and missing number values are not very telling, it would be useful to know percentage of values being nulls

Describe the solution you’d like

Show percentages of values which are nulls

Describe the alternative you’ve considered or used

Doing match in you head instead, comparing nulls values with total rows added

Add support for Apache Hive

What

Support for Hive as a backend

How

Use SQLAlchemy and PyHive for this and query data with it:

https://github.com/dropbox/PyHive

[BUG] TypeError: can't compare offset-naive and offset-aware datetimes

Describe the bug
The bug is from table.py on line 127 script where you are comparing a naive datetime object with a datetime object that is timezone aware.

Expected behavior
In the source code now_ts is not timezone aware but max_ts is aware.
variables max_ts and now_ts are both either supposed to be timezone aware or not.

To Reproduce
When I created a data source with a redshift cluster.

Screenshots

[BUG] Column name with spaces are not rendered properly for sql statement

Describe the bug
get_max_timestamp function uses column name of type datetime to get the max timestamp. This will work for a column with no spaces (e.g. date_of_transaction) in the name, but the moment a column with spaces (e.g. date of transaction) is used, an error is thrown.

Expected behavior
Assuming the column name is date of transaction and full_table_name is mv_stock

f"SELECT max({column}) as value FROM {table.full_table_name}"

I would expect to get this

f"SELECT max("date of transaction") as value FROM mv_stock"

To Reproduce
Set up a data source with any source type with tables that have a DateTime column with a column name that has spaces.

Screenshots

Redata vs Airflow logout

What is unexpected behavior

Switching between Redata UI and Airflow UI logs you out. (Both apps are using flask)

Suggested fix

Changes to Redata to not conflict with Airlow login session.

[BUG] errors on dbt compile

Describe the bug

# I commented out this line
# profile: 're_data_postgres'

then I ran:

(.venv) daniel@dbrtly-MBP dbt_shop % dbt compile
Running with dbt=0.20.0

Found 26 models, 44 tests, 0 snapshots, 0 analyses, 647 macros, 0 operations, 0 seed files, 4 sources, 1 exposure

20:04:50 | Concurrency: 4 threads (target='dev')
20:04:50 | 
Encountered an error:
Runtime Error
  Runtime Error in model re_data_base_metrics (models/intermediate/re_data_base_metrics.sql)
    404 Not found: Dataset daniel-bartley-sandbox:dbt_shop_re was not found in location US
    
    (job ID: 69d805c5-c7e3-4bf0-9cdd-ee8357fd0d38)

I manually created the dataset to see what would happen.

(.venv) daniel@dbrtly-MBP dbt_shop % dbt compile
Running with dbt=0.20.0
[WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds

Found 26 models, 44 tests, 0 snapshots, 0 analyses, 647 macros, 0 operations, 0 seed files, 4 sources, 1 exposure

20:07:51 | Concurrency: 4 threads (target='dev')
20:07:51 | 
Encountered an error:
Runtime Error
  Runtime Error in model re_data_freshness_inc (models/intermediate/re_data_freshness_inc.sql)
    404 Not found: Table daniel-bartley-sandbox:dbt_shop_re.re_data_tables was not found in location US
    
    (job ID: 62bf3af6-1656-4fb2-88d0-42166b550568)

Expected behavior

This command should run with no errors related to the re_data package.

dbt compile

To Reproduce
Assuming you start with a functional project named jaffle_shop (1 source, 1 additional model) with python, a valid profile, permissions, etc. The following should run without an error:

cd jaffle_shop
 {
        echo 'packages:'
        echo '  - package: re-data/re_data'
        echo "    version: ['>=0.2.0', '<0.3.0']"
} > packages.yml
dbt deps
dbt compile

Screenshots
Not UI related

Logs and additional context
2021-08-02 10:07:52.639923 (Thread-2): Finished running node model.re_data.re_data_schema_changes
2021-08-02 10:07:52.640174 (Thread-1): Finished running node model.re_data.re_data_freshness_inc
2021-08-02 10:07:52.640377 (Thread-4): Finished running node model.re_data.re_data_base_metrics
2021-08-02 10:07:52.641022 (MainThread): Connection 'master' was properly closed.
2021-08-02 10:07:52.641607 (MainThread): Connection 'model.re_data.re_data_freshness_inc' was properly closed.
2021-08-02 10:07:52.641744 (MainThread): Connection 'model.re_data.re_data_schema_changes' was properly closed.
2021-08-02 10:07:52.641823 (MainThread): Connection 'model.dbt_shop.customer_orders' was properly closed.
2021-08-02 10:07:52.641893 (MainThread): Connection 'model.re_data.re_data_base_metrics' was properly closed.
2021-08-02 10:07:52.642121 (MainThread): Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10b8f1f10>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10b948f10>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10bcd3a30>]}
2021-08-02 10:07:52.642348 (MainThread): Flushing usage events
2021-08-02 10:07:53.707659 (MainThread): Encountered an error:
2021-08-02 10:07:53.707887 (MainThread): Runtime Error
Runtime Error in model re_data_freshness_inc (models/intermediate/re_data_freshness_inc.sql)
404 Not found: Table daniel-bartley-sandbox:dbt_shop_re.re_data_tables was not found in location US

(job ID: 62bf3af6-1656-4fb2-88d0-42166b550568)

2021-08-02 10:07:53.710593 (MainThread): Traceback (most recent call last):
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/main.py", line 125, in main
results, succeeded = handle_and_check(args)
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/main.py", line 203, in handle_and_check
task, res = run_from_args(parsed)
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/main.py", line 256, in run_from_args
results = task.run()
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/task/runnable.py", line 425, in run
result = self.execute_with_hooks(selected_uids)
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/task/runnable.py", line 384, in execute_with_hooks
res = self.execute_nodes()
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/task/runnable.py", line 339, in execute_nodes
self.run_queue(pool)
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/task/runnable.py", line 246, in run_queue
self._raise_set_error()
File "/Users/daniel/git/dbrtly/dbt_shop/.venv/lib/python3.9/site-packages/dbt/task/runnable.py", line 222, in _raise_set_error
raise self._raise_next_tick
dbt.exceptions.RuntimeException: Runtime Error
Runtime Error in model re_data_freshness_inc (models/intermediate/re_data_freshness_inc.sql)
404 Not found: Table daniel-bartley-sandbox:dbt_shop_re.re_data_tables was not found in location US

(job ID: 62bf3af6-1656-4fb2-88d0-42166b550568)

2021-08-02 10:07:53.710829 (MainThread): unclosed running multiprocessing pool <multiprocessing.pool.ThreadPool state=RUN pool_size=4>

Make generated home default page in Grafana

It should be possible to call Grafana API to make it home

Make re_data work in case of no creation time column

What

Currently for rePdata to work for tables need to have creation_time column representing time when given record was added to a table. It's not always possible to have that (although it's most likely very good practice to have it) It's possible to produce some stats even if that column doesn't exist in following cases:

If there is incremental index on table, it's possible to use this column instead of creation_time for filtering, but it requires for each redata run saving last created index with current timestamp.
If there is no incremental index or column, it's impossible to create many stats for last_day etc, but it's still possible to detect schema changes and compute custom metrics. Redata most likely should enable that. Currently table is just skipped.

[BUG] - BigQuery resources exceeded during query execution

Describe the bug
Receive an error when running

dbt run --models package:re_data --vars \ '{ "re_data:time_window_start": "2021-06-01 00:00:00", "re_data:time_window_end": "2021-08-05 00:00:00" }'

`Completed with 1 error and 0 warnings:

Database Error in model re_data_base_metrics (models/intermediate/re_data_base_metrics.sql)
  Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
  compiled SQL at target/run/re_data/models/intermediate/re_data_base_metrics.sql`

Expected behavior
dbt job should be able to complete successfully

To Reproduce
Steps to reproduce the behavior:
run
dbt run --models package:re_data --vars \ '{ "re_data:time_window_start": "2021-06-01 00:00:00", "re_data:time_window_end": "2021-08-05 00:00:00" }'

Running on 2 schemas, 25 tables, 489 rows

Screenshots
If applicable, add screenshots to help explain your problem.

Logs and additional context
If application, any other context, logs etc.here

Running version 0.20.0 dbt and re_data

Add stress tests for monitoring hundreds of tables

[FEATURE] Add null % and missing % metric

Tell us about the problem you're trying to solve

It would be nice to learn % of values which are missing or nulls in a given column.
This maybe a better metric for computing anomalies in those and also more clear to the users.

Describe the solution you’d like

Adding re_data_missing_percent re_data_null_percent models and missing_percent null_percent as metrics computed.

Add method of deploying to AWS easily

[FEATURE] last day var parameters

Tell us about the problem you're trying to solve
It's currently hard to schedule re_data jobs in dbt cloud (it's impossible to set dynamic re_data:time_window_end, re_data:time_window_start in job definition)

It would be nice if re_data dbt package enabled easy way to compute last day stats.

Describe the solution you’d like

Another var parameters, which if passed will cause re_data to compute last day stats. Output of run with this parameters will be equivalent of run with: re_data:time_window_start, re_data:time_window_end set to previous day.

multi-database support

Tell us about the problem you're trying to solve
My current production data warehouse architecture incorporates several layers of load and transformation, across several databases, over many terabytes of data (and growing). Example: my cleaning step copies data from raw sources to my 'cleaned' database, my enrichment step copies from 'cleaned' to 'enriched', and a third modeling step copies from 'enriched' to a third DB + schema for consumption. I would like to leverage re_data against all of these database/schema combinations, but the latest version of re_data is limited to reading from one database at a time. I would also like to materialize my re_data models in a dedicated database apart from my sources, but there are a number of problems with this, including:

This means I need to create a handful of new connection profiles in dbt profiles.yml just to run re_data queries
The databases configured there may be different from those in dbt_project.yml, which makes maintenance even more complicated
The re_data_tables model is incremental, and as such, adapts to multiple databases well. If you include schemas from multiple DBs in the re_data:schemas config, it catches them all. But the re_data_columns model is materialized as a table, so it is rebuilt on every run, and only picks up columns from the schema you've configured as both your source and destination.

Finally, I would like to be able to install and configure re_data natively to my production dbt ETL project, but due to collisions with other dbt packages, existing dbt_project.yml and profiles.yml settings and mappings, I can't do this.

Describe the solution you’d like
Suggested fixes include separation of source and destination, addition of database + schema var configuration, and/or incrementalize re_data_columns somehow (this would also offer schema change detection, which I believe has already been requested). Possible custom macros to generate schema and/or database mappings.

Describe the alternative you’ve considered or used
For now, I am maintaining three separate dbt projects separate from my primary dbt ETL project. This means three additional versions of typical dbt artifacts, and upon automation, additional layers that I need to consider in scripting container builds and commands, Airflow invocations, etc.

Additional context

Snowflake user

Add Kubernetes support

[FEATURE] Allow SSL to be used if server requires it

Tell us about the problem you're trying to solve
When connecting to a redshift cluster using the ip address of the cluster as host in the connection string, an error stating server certificate for <redshift-cluster-endpoint> does not match host name <cluster-ip-address>.

Describe the solution you’d like
Simply adding connect_args={'sslmode': 'allow'} parameter to the create_engine function in db_operations.py and data_source.py will solve the issue.

Describe the alternative you’ve considered or used
I have tried and tested the above solution and it works.

[BUG] Editing source name duplicates tables

Describe the bug

When editing source_db name all tables after edit, Redata is loosing connection between this source and previously created tables. It's because string not id relation in DB.

Expected behavior

Editing name shouldn't cause tables to be rediscovered under new data source name

Fix viewing logs in airflow webserver

Currently trying to view logs for a specific run in airflow webserver throw error:

*** Log file does not exist: /opt/airflow/logs/validation_dag/run_checks_REDATA/2021-01-08T16:30:00+00:00/1.log
*** Fetching from: http://e78f9bc39c5d:8793/log/validation_dag/run_checks_REDATA/2021-01-08T16:30:00+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='e78f9bc39c5d', port=8793): Max retries exceeded with url: /log/validation_dag/run_checks_REDATA/2021-01-08T16:30:00+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe8ae72550>: Failed to establish a new connection: [Errno 111] Connection refused',))

It's a known problem, some possible solutions are described here: puckel/docker-airflow#44

Sharing logs between webserver/scheduler sounds like not bad idea ;)

[FEATURE] Add distinct values count (for text field)

Tell us about the problem you're trying to solve

Add distinct values support for re_data

Describe the solution you’d like

Number of distinct values computed for all text columns in DB

Describe the alternative you’ve considered or used

If it's too expensive to compute that for every column, we may consider doing this
only when specific check added to DB. This maybe needed for big DBs.

[FEATURE] LImit re_data run command to re_data models

Tell us about the problem you're trying to solve

Currently, if the dbt_project project contains different models than ones created by re_data those will be also computed one running re_data.

Describe the solution you’d like

The solution should be simple, just use a package filter on models computed, when running dbt

MS SQL support

What

Support for MS SQL as DB in Redata

How

For testing, it would be good to use this docker: https://hub.docker.com/_/microsoft-mssql-server

It should be possible to connect to it using sqlalchemy - some details described here:
https://docs.sqlalchemy.org/en/14/core/engines.html#microsoft-sql-server

[BUG] Deleting source doesn't delete tables from interface

Describe the bug

When you delete data source, tables connected to it still exist and are monitored.

Expected behavior

It would be expected that those are also deleted. Although warning about this to happen should be showed

To Reproduce

Delete data source which already had tables discovered related to it.

Add support for Amazon Redshift

Add automatic tests

Add tests for re_data command line & template dbt project

[INTEGRATION] Improve Exasol integration

Tell us about the problem you're trying to solve

Exasol is one of currently supported DBs, but there are 2 problems with current integration:

no easy way for for testing it (for other DBs I either have either dev accounts setup or docker images with DBs to test against
current integration is separate from SqlAlchemy based DBs I expect it could be easier to maintain if it used same code as SqlAlchemy integrations

Describe the solution you’d like

Use: https://github.com/exasol/docker-db and add running Exasol to compose running sample databases
Move to SqlAlchemy integration

Describe the alternative you’ve considered or used

Moving to SqlAlchemy is optional but in case of skipping it, it would be good to add features to Exasol integration to make it on par with other supported DBs

Adding black + isort

It will be really good to automatically format code which is commited

Render query variables in the sql custom checks

Tell us about the problem you're trying to solve
In the custom query check functionality, query strings containing variables eg. select col_a from {{table_name}} are currently failing when these variables are written a bit differently than expected eg. {{ table_name }}. This is due to check_custom_query function expecting the name of these variables in exactly the same format within the code, which then uses a string replace.

Describe the solution you’d like
This is easily solved by a templating framework like jinja2. For eg: you can easily write

select col_a from {{table_name}}
select col_a from {{ table_name }}, etc.

which are all valid. We'll have to update the check_custom_query with something along the lines

from jinja2 import Template

def check_custom_query(self, table, query, conf):
       ...
        query = Template(query)
        query = query.render(
            table_name=table.full_table_name,
            period_start=period_start,
            period_end=period_end
        )

Describe the alternative you’ve considered or used
An alternative can be to use a regex pattern.

Investigate trino

What

It's possible with Trino (former Presto) we would be able to add support
for streaming and possibly also s3 (still staying with using SQL)

Redata would then need to start internal Trino cluster most likely and use to to query
streaming/filesystem datasets. It's definitely more complicated setup then current
queries to DB backends but it could solve monitoring for many currently not supported backends.
List of connections Trino can connect to is here: https://trino.io/docs/current/connector.html

re-data / re-data Goto Github PK

re-data's Introduction

What is re_data?

Live demo

Getting started

Support

License

Contributing

re-data's People

Contributors

Stargazers

Watchers

Forkers

re-data's Issues

How

Testing

How

What

What

What

How

Why

How

What

How

What is unexpected behavior

Suggested fix

What

What

How

What

Recommend Projects

Recommend Topics

Recommend Org