Comments (3)
Hi,
Here's some context for this issue. Deequ lets you run checks on your data by constructing a VerificationSuite object. You can build a VerificationSuite
by calling onData()
, which then exposes the addCheck()
method. You can call that repeatedly to add multiple checks to your suite, for example:
val verificationResult = VerificationSuite()
.onData(data)
.addCheck(Check(CheckLevel.Error, "must have 5 rows").hasSize(_ == 5)
.addCheck(Check(CheckLevel.Error, "must have no nulls").isComplete("id")
.run()
All checks within the same Verification suite are processed before Spark is called, and Deequ comes up with a plan to calculate all the necessary statistics without making unnecessary passes over the data.
However, for the comparison operations @rdsharma26 is mentioning above, there is no Check
object and therefore they cannot be added to a VerificationSuite
. This means two things:
- There is no way for Deequ to optimize the execution of these comparisons
- The code to run any of these checks looks incongruent with any other Deequ syntax, as you need to directly invoke methods rather than constructing a cohesive suite of tests.
The ask here is to merge the two operations with the standard Deequ APIs, so a user can create a verification suite that contains a mix of cross-dataset and in-dataset tests. This will probably require a bit of refactoring in the VerificationRunBuilder, because unlike any Check
s we have today, the cross-dataset checks require an additional reference dataset in addition to the primary dataset (passed using onData()
, which returns a VerificationRunBuilder
).
Let me know if that's not clear or you have any follow-up questions.
from deequ.
@rdsharma26 could you add more details on what enhancement we are looking at here, I could take a stab at implementation. thanks
from deequ.
@mentekid thanks, I can take a stab at this, will circle back once PR is ready.
from deequ.
Related Issues (20)
- Incremental profiling to be merged with older result
- Adding the custom constraints
- [FEATURE] Extract failing reason when filtering records based on row-level checks
- java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAggregateExpression(Z)Lorg/apache/spark/sql/catalyst/expressions/aggregate/AggregateExpression;
- Support for Snowflake Connector's query pushdown HOT 1
- Is this library can be used with other Technolgy rather than Spark, such as Flink for example? HOT 2
- [BUG] Unable to serialize Histogram with binningUdf when using them with useRepository
- [FEATURE] Add spark table metric repository HOT 4
- Getting Error name 'isComplete' is not defined while running deequ code in Azure Databricks HOT 4
- checks that 95% of entire table satisfy multiple conditions over different columns HOT 1
- [FEATURE] Add support for Spark 3.5 HOT 1
- [BUG] Row based output incorrect when using satisfies check and assertion with upper bound < 1 HOT 3
- [FEATURE] Exposing Anomaly Strategy Calculation Thresholds for Users
- Is Redshift supported as a data source?
- Compliance calculation result HOT 1
- numerical statistical indicators have lost precision
- [FEATURE] Supporing Aggregation metrics for a group
- [FEATURE] Filter condition is ignored when filtering records based on row-level checks HOT 5
- Anomaly checks when fails
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deequ.