Comments (5)
Hi @eapframework, thank you for raising this issue. Could you provide an example of the behavior you're seeing and the behavior you expect/want instead?
We're actually looking into a similar issue - here's an example unit test using the Completeness
analyzer:
For the dataframe:
def getDfCompleteAndInCompleteColumns(sparkSession: SparkSession): DataFrame = {
import sparkSession.implicits._
Seq(
("1", "a", "f"),
("2", "b", "d"),
("3", "a", null),
("4", "a", "f"),
("5", "b", null),
("6", "a", "f")
).toDF("item", "att1", "att2")
}
This test checks the row-level results
"return row-level results for columns filtered" in withSparkSession { session =>
val data = getDfCompleteAndInCompleteColumns(session)
val completenessAtt2 = Completeness("att2", Option("att1 = \"a\""))
val state = completenessAtt2.computeStateFrom(data)
val metric: DoubleMetric with FullColumn = completenessAtt2.computeMetricFrom(state)
data.withColumn("new", metric.fullColumn.get).collect().map(_.getAs[Boolean]("new")) shouldBe
Seq(true, false, false, true, false, true)
}
Using the verification suite on a similar test:
+----+----+----+-----+
|item|att1|att2|rule1|
+----+----+----+-----+
| 1| a| f| true|
| 2| b| d|false|
| 3| a|null|false|
| 4| a| f| true|
| 5| b|null|false|
| 6| a| f| true|
+----+----+----+-----+
Here we can see that the rows that EITHER are filtered out (rows 2,5 att1 is not a
) or fail the check (row 3 is null) are marked as false.
Would you expect rows 2,5 to show true/None in this case?
from deequ.
Thanks for your feedback @eapframework,
We're working through different use cases for different users and I'm planning a PR for this soon. We're planning on providing a configuration so users can set filtered rules as Null or True - so setting this configuration to True should meet your use-case. I'll tag you on the PR once we have that out as well.
from deequ.
Thanks for your response. This is the same issue I am facing.
I am expecting rows 2,5 to show true because those are not failed records
Expected result:
+----+----+----+-----+
|item|att1|att2|rule1|
+----+----+----+-----+
| 1| a| f| true|
| 2| b| d| true|
| 3| a|null|false|
| 4| a| f| true|
| 5| b|null| true|
| 6| a| f| true|
+----+----+----+-----+
from deequ.
Hi @eapframework, we've merged PR #532 addressing this issue for Uniqueness and Completeness analyzers and another one open for other analyzers: #535
Please let us know if you have any feedback on these PRs and add comments or open a PR if this doesn't quite meet your use-case.
from deequ.
Related Issues (20)
- Support for Snowflake Connector's query pushdown HOT 1
- Is this library can be used with other Technolgy rather than Spark, such as Flink for example? HOT 2
- [BUG] Unable to serialize Histogram with binningUdf when using them with useRepository
- Incorporate referential integrity and data synchronization checks into Deequ's VerificationSuite HOT 5
- [FEATURE] Add spark table metric repository HOT 4
- Getting Error name 'isComplete' is not defined while running deequ code in Azure Databricks HOT 4
- checks that 95% of entire table satisfy multiple conditions over different columns HOT 1
- [FEATURE] Add support for Spark 3.5 HOT 1
- [BUG] Row based output incorrect when using satisfies check and assertion with upper bound < 1 HOT 3
- [FEATURE] Exposing Anomaly Strategy Calculation Thresholds for Users
- Is Redshift supported as a data source?
- Compliance calculation result HOT 1
- numerical statistical indicators have lost precision
- [FEATURE] Supporing Aggregation metrics for a group
- Anomaly checks when fails
- containsCreditCardNumber analyser constraint doesnt support for JCB credit card
- Performance impact when trying to generate profiling report for more than 200 columns HOT 2
- Is AggregateMatch type check supported in the library? HOT 1
- [FEATURE] Cross-building via Mill HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deequ.