Comments (6)
@abegong: Looking at this with some mixed feelings.
I just created a PR to do it as it stands now, but my concern is that those regexs might get too closely identified with GE and/or be hard to version--certainly many of them could be better than they are now, and we'll want to be able to improve that without breaking peoples' expectations later on.
Adding a GE version to an expectation config may address that.
from great_expectations.
Instead of including named regexes as first-class citizens, we're going to make it simple to include named regexes (and other variables) in an expectation_config "meta" object.
Proposed API:
Suppose meta looks like this:
{
"__version__" : "0.2.3",
"regexes" : {
"namelike" : "^[A-Z][a-z]+$",
"leading_or_trailing_whitespace" : "^\W.*?\W$",
"double_vowels" : [
"aa", "ee", "ii", "oo", "uu", "yy"
]
}
}
...then these are all valid function calls:
expect_column_values_to_match_regex("my_column", {"$meta_var" : "regexes.namelike"})
expect_column_values_to_not_match_regex("my_column", {"$meta_var" : "regexes.leading_or_trailing_whitespace"})
expect_column_values_to_match_regex_list("my_column", {"$meta_var" : "regexes.double_vowels"}, "any")
#Match against 'aa'
expect_column_values_to_match_regex("my_column", {"$meta_var" : "regexes.double_vowels.0"})
#Match against 'aa' or 'yy'
expect_column_values_to_match_regex_list("my_column", [{"$meta_var" : "regexes.double_vowels.0"}, {"$meta_var" : "regexes.double_vowels.5"}], "any")
At runtime, each $meta_var
expression is evaluated by traversing the meta
objects tree. This syntax is roughly equivalent to mongodb syntax for querying nested objects. (Mongo also supports matching on multiple criteria, which we don't need here.)
In the config, these will create entries like:
{
"expectation_type" : "expect_column_values_to_match_regex",
"kwargs" : {
"column" : "my_column",
"regex" : {"$meta_value" : "regexes.namelike"}
}
}
{
"expectation_type" : "expect_column_values_to_not_match_regex",
"kwargs" : {
"column" : "my_column",
"regex" : {"$meta_value" : "regexes.leading_or_trailing_whitespace"}
}
}
{
"expectation_type" : "expect_column_values_to_match_regex_list",
"kwargs" : {
"column" : "my_column",
"regex_list" : {"$meta_value" : "regexes.double_vowels"},
"match_on" : "any"
}
}
{
"expectation_type" : "expect_column_values_to_match_regex",
"kwargs" : {
"column" : "my_column",
"regex" : {"$meta_value" : "regexes.double_vowels.0"},
}
}
{
"expectation_type" : "expect_column_values_to_match_regex_list",
"kwargs" : {
"column" : "my_column",
"regex_list" : [
{"$meta_value" : "regexes.double_vowels.0"},
{"$meta_value" : "regexes.double_vowels.5"}
],
"match_on" : "any"
}
}
Note: all of these examples are given in terms of regexes, but this pattern can be applied to virtually any expectation argument.
Therefore, I think the most straightforward way to implement this proposal is by adding logic to @expectation
which will iterate over all args and kwargs to find instances of {"$meta_var": ...}
and perform the appropriate substitution.
from great_expectations.
Thanks for capturing this discussion! One small tweak: rather than __version__
what about ge_version
: that avoid confusion about name mangling (since it's a dict key not a class variable anyway) and also confusion about whether it's a version of the config (it's not) or a version of the ge library (it is).
from great_expectations.
How about we remove all ambiguity and go with great_expectations.__version__
?
That's explicit enough that a developer encountering a config for the first time (with no background in great_expectations at all) could still Google and figure it out.
from great_expectations.
Sounds good to me.
from great_expectations.
Confirming a discussion we had here:
- we want to build a mechanism for making explicit when the
meta
object access will be explicit and when it will be implicit.
from great_expectations.
Related Issues (20)
- KeyError 'table.head' when running `validator.head()`
- how to append multiple validation results in a doc site HOT 2
- GX fails with SAP HANA
- FIx/Document Plugins with EphermalDataContext
- row_condition error in great_expectations for spark
- unable to limit sql asset to top 1000 rows
- Nested columns on pyspark fail in expectations that use ColumnValueCounts
- Indexing error when validating expect_column_pair_values_to_be_in_set
- return_unexpected_index_query returning broken query, escaping double quotes in result for SparkDFExecutionEngine
- expect_column_values_to_not_be_null duplicates columns in unexpected values query
- Can't specify quoting method for CSV files HOT 1
- How do we disable uploading data to posthog.greatexpectations.io endpoint in GX 1.0.0a4? HOT 1
- import great expectation fails on MacOS (M1)
- Any sample code available for using profiling in great-expectations 1.0 - pre release?
- Azure SQL Table Data Asset throwing “NoneType object is not iterable” while validating in azure databricks HOT 1
- I have problem with conditions HOT 1
- Error using EXEC operator
- String indices must be indices error when calling list_active_stores()
- Slack notification formatting issues - 0.18.14 HOT 5
- No option to select encoding - UnicodeDecodeError HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from great_expectations.