Comments (7)
after much thought, the solution to the problem of specifying additional columns in a Column
Check
was to introduce a groupby
argument to the Check
class.
The reason for this is that it enables the user to group a Column
of interest by some other column in the dataframe with some reasonable guard rails.
If we went with the depends_on
solution proposed in https://github.com/cosmicBboy/pandera/issues/16#issuecomment-494469311, the API would be too generic, in that the deps
dataframe could be arbitrarily used to make assertions, which isn't quite the desired behavior: what we want is to validate properties about the Column
of interest, not the dependency columns. This use case isn't necessarily "bad", but it can be enabled by something like #14.
The groupby
argument groups the Column
of interest into a dict
of key, Series
, where the key is a discrete group. This actually enables the user to assert properties about subsets of the Column
by the desired group.
from pandera.
What's your view on handling the nullable
, strict
, allow_duplicates
when supplying a tuple of columns for checking?
Do you see something like this?:
DataFrameSchema({
"a": Column(Float, nullable=True),
("b", "c", "d"): Columns(
(
(Int, nullable=True, strict=True),
(Float, nullable=True, allow_duplicates=True),
(String, nullable=True, strict=False),
),
Check(lambda b, c, d: b*c == d > 10),
"e": Column(Bool, nullable=True),
})
from pandera.
this is an interesting problem... I'm wondering if it would make more sense to extend the Check
class so specifying a tuple of column names that the column depends_on
that modifies the signature of the check function, like so:
DataFrameSchema({
"a": Column(Float, nullable=True, checks=[
Check(lambda a: a > 0),
Check(lambda a, b, c: a[(b == "foo") & (c == "bar")] > 10, depends_on=("b", "c"))
]),
"b": Column(String, nullable=True),
"c": Column(String, nullable=True),
})
This way the user only needs to specify column properties once, and modify the function signature of a Check
.
I'm concerned that my original proposal will make schemas more convoluted to read.
One problem with this above proposal that I haven't thought about too much is whether the order of execution of Check
s matter. Before each check was atomic in the sense that it only depends on one column... but now with depends_on
, I wonder if under the hood we'd need to run checks on "b" and "c" before executing the check on a[(b == "foo") & (c == "bar")] > 10
.
Can probably check at schema initialization whether there are circular dependencies, and error out immediately if there are any, e.g.
DataFrameSchema({
"a": Column(Float, nullable=True, checks=[
Check(lambda a, b: a[(b == "foo")] > 10, depends_on=("b"))
]),
"b": Column(String, nullable=True, checks=[
Check(lambda b, a: b[(a > 5)] == "bar", depends_on=("a"))
]),
})
# SchemaError: circular dependency in columns "a" and "b"
from pandera.
Specifying depends_on
before those columns are defined seems a bit odd on first read (it makes me think of the classic IDE error "variable referenced before assignment").
Maybe the use of b
in the check
of a
could be handled implicitly, i.e. the fact that b
exists is enough for building a depends_on
dependency graph behind the scenes?
That still suffers from the readability issue.
One option could be having an entirely separate checks
list separate from the columns
for readability, something like this:
DataFrameSchema(
columns={
"a": Column(Float, nullable=True),
"b": Column(String, nullable=True),
"c": Column(String, nullable=True)
},
checks=[
Check(lambda a: a > 0),
Check(lambda a, b, c: a[(b == "foo") & (c == "bar")] > 10)
]
)
from pandera.
the separate checks
idea is challenging because I think checks should be bound to the data provided by the columns
keys. There really isn't a straightforward way to let a Check
know what the function signature is besides associating a Check
with column names in an explicit way.
Really, the intent behind checks that involve multiple columns is to enable things like hypothesis testing https://github.com/cosmicBboy/pandera/issues/17, which I think is the strongest use case for a solution where additional column dependencies are specified at the Check
level, one possible proposal being the depends_on
kwarg.
Basically the backend implementation of the one_sided_t_test described in #17 would be something like:
from scipy import stats
def two_sample_one_sided_ttest(group1, group2, relationship="gt", alpha=0.05, **kwargs):
ttest = stats.ttest_ind(group1, group2, **kwargs)
is_significant = ttest.pvalue / 2 < alpha
# `relationship` refers to the relationship of group1 w.r.t. group2
if relationship == "gt":
return ttest.statistic > 0 and is_significant
elif relationship == "lt":
return ttest.statistic < 0 and is_significant
else:
raise ValueError("relationship %s not recognized" % relationship)
DataFrameSchema({
"weight": Column(
Float,
checks=[
Check(
lambda w, s: two_sample_one_sided_ttest(w[s == 'men'], w[s == 'women']),
depends_on=("sex")
)
]
),
"sex": Column(String, Check(lambda s: s.isin(["men", "women"]).all()))
})
Specifying depends_on before those columns are defined seems a bit odd on first read (it makes me think of the classic IDE error "variable referenced before assignment").
I think this makes sense for imperative program models, but the goal with pandera is that the schema definition is (somewhat) declarative, such that the properties of a dataframe are asserted up-front, and the details of how validation happens is abstracted away for the user... I say somewhat because the validators are still fairly low-level, e.g. asserting the values of a column are positive via lambda s: s > 0
.
In this way we're not really defining columns in an imperative way, we're declaring that a Check
associated with the column weight
depends on this other column sex
. What are the properties of the sex
column? Well, the user would be expected to define these properties with {"sex": Column(...)}
Not sure if this thought process makes sense... we can continue mulling this over, but two of the things I'd like to shoot for are:
- keeping the API as clear and concise as possible
- discouraging an API that leads to over-specification of column properties
from pandera.
I like your thinking - column wise checks follow the logic of "this column depends on these others", and fits the primary use case. The syntax gets clunky if a user writes too many overlapping checks, and that can easily be handled outside of Pandera if required.
I looked at implementing it and one observation: if the user made a mistake in specifying the lambda
function or depends_on
, it would be difficult to diagnose/debug:, e.g.:
DataFrameSchema({
"weight": Column(
Float,
checks=[
Check(
lambda w, s, h, a: w+s+a+h,
depends_on=("sex","age","height")
)
]
)
})
In the above function, the positional order of depends_on
sets h=age
, height=a
which is probably not what the user intended. Not sure if or how to make this better.
One possible implementation leaves open the possibility of changing syntax in future relatively flexibly: passing a dataframe through SeriesSchemaBase
and into Check
.
So the call of Check
goes from:
def __call__(self, parent_schema, series, index):
and becomes this:
def __call__(self, parent_schema, dataframe, index):
To make this work the whole dataframe could be passed into SeriesSchemaBase
, and the existing code be adapted to only reference the named column, i.e. using df[self._name]
currently in column
.
Then modifying the way the checks are called in SeriesSchemaBase
from:
for i, check in enumerate(self._checks):
check_results.append(check(self, series, i))
into something like:
for i, check in enumerate(self._checks):
check_results.append(check(self, df[this_column,depend_on], i))
Effectively by passing a Dataframe into Check
, either the subsetted dataframe, or the full dataframe - instead of multiple series leaves open the possibility of extending DataFrameSchema to also have checks, or changing the syntax..
from pandera.
In the above function, the positional order of depends_on sets h=age, height=a which is probably not what the user intended. Not sure if or how to make this better.
good point, I do like the direction of making the API for specifying checks clearer, and I'd like to maintain the separation of the primary column (the one that is specified in the columns
key) and the depends_on
columns.
For example:
DataFrameSchema({
"weight": Column(
Float,
checks=[
Check(
lambda w, d: w + d["sex"] + d["age"] + d["height"],
depends_on=("sex","age","height")
)
]
)
})
basically specifying depends_on
modifies the expected Check
function signature to
(series, dataframe) -> bool|pd.Series[bool]
where dataframe
contains the columns specified in depends_on
.
I like this because weight
is still a first-class argument to the check function and sex
, age
, height
are contained within the second argument.
It's then the user's responsibility to use the contents in the dependencies dataframe to make an assertion about the primary column.
On a side-note, more complex check
functions (with many dependencies and complex logic) should probably be defined before hand with the def myfunc(x, df)
pattern.
from pandera.
Related Issues (20)
- Should DataFrameModel.validate raise SchemaInitError when using pydantic v2 `model_config`? HOT 2
- Pandera import fails due to DatetimeAccessor issue in Python 3.11.9 HOT 1
- Pandera Series generic argument does not allow the `typing.List` etc. types HOT 3
- no exception raise after pipe on 3.11.9 HOT 1
- How to Avoid Pandera Doc Injection? HOT 4
- Custom check fails with `pl.DataFrame` HOT 3
- Custom check erroneously passes when validating `pl.LazyFrame` HOT 8
- Remove dependencies on `wrapt`, `packaging`, and potentially `multimethod`
- Make `pydantic` and `typeguard` extras for pandas generic type support
- Is there a way to get a `DataFrameModel` from existing `pandas.DataFrame`? HOT 2
- Hypothesis examples are all the same HOT 7
- Parametrized type annotations are broken for polars DataFrameModels HOT 1
- Piping pandas with pandera schema doesn't raise SchemaError ( python 3.11.9 ) HOT 1
- Lazy schema validation does not raise expected errors with polars dataframes HOT 3
- Update branch name mentioned within bug report template HOT 1
- Custom DTypes With Polars HOT 3
- Error Importing Pandera with Polars extra HOT 2
- Add a polars `Series` type HOT 10
- Allow check type HOT 1
- How to load schema from pyspark struct or avro format from schema registry ? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.