if the user needs to validate some column x condition

What's your view on handling the nullable , <code clas

add support for grouping column of interest by another column. about pandera HOT 7 CLOSED

unionai-oss commented on May 22, 2024

add support for grouping column of interest by another column.

from pandera.

Comments (7)

cosmicBboy commented on May 22, 2024 1

after much thought, the solution to the problem of specifying additional columns in a Column Check was to introduce a groupby argument to the Check class.

The reason for this is that it enables the user to group a Column of interest by some other column in the dataframe with some reasonable guard rails.

If we went with the depends_on solution proposed in https://github.com/cosmicBboy/pandera/issues/16#issuecomment-494469311, the API would be too generic, in that the deps dataframe could be arbitrarily used to make assertions, which isn't quite the desired behavior: what we want is to validate properties about the Column of interest, not the dependency columns. This use case isn't necessarily "bad", but it can be enabled by something like #14.

The groupby argument groups the Column of interest into a dict of key, Series, where the key is a discrete group. This actually enables the user to assert properties about subsets of the Column by the desired group.

from pandera.

mastersplinter commented on May 22, 2024

What's your view on handling the nullable, strict, allow_duplicates when supplying a tuple of columns for checking?

Do you see something like this?:

DataFrameSchema({
    "a": Column(Float, nullable=True),
    ("b", "c", "d"): Columns(
        (
            (Int, nullable=True, strict=True),
            (Float, nullable=True, allow_duplicates=True),
            (String, nullable=True, strict=False),
        ),
        Check(lambda b, c, d: b*c == d > 10),
    "e": Column(Bool, nullable=True),
})

from pandera.

cosmicBboy commented on May 22, 2024

this is an interesting problem... I'm wondering if it would make more sense to extend the Check class so specifying a tuple of column names that the column depends_on that modifies the signature of the check function, like so:

DataFrameSchema({
    "a": Column(Float, nullable=True, checks=[
        Check(lambda a: a > 0),
        Check(lambda a, b, c: a[(b == "foo") & (c == "bar")] > 10, depends_on=("b", "c"))
    ]),
    "b": Column(String, nullable=True),
    "c": Column(String, nullable=True),
})

This way the user only needs to specify column properties once, and modify the function signature of a Check.

I'm concerned that my original proposal will make schemas more convoluted to read.

One problem with this above proposal that I haven't thought about too much is whether the order of execution of Checks matter. Before each check was atomic in the sense that it only depends on one column... but now with depends_on, I wonder if under the hood we'd need to run checks on "b" and "c" before executing the check on a[(b == "foo") & (c == "bar")] > 10.

Can probably check at schema initialization whether there are circular dependencies, and error out immediately if there are any, e.g.

DataFrameSchema({
    "a": Column(Float, nullable=True, checks=[
        Check(lambda a, b: a[(b == "foo")] > 10, depends_on=("b"))
    ]),
    "b": Column(String, nullable=True, checks=[
        Check(lambda b, a: b[(a > 5)] == "bar", depends_on=("a"))
    ]),
})

# SchemaError: circular dependency in columns "a" and "b"

from pandera.

mastersplinter commented on May 22, 2024

Specifying depends_on before those columns are defined seems a bit odd on first read (it makes me think of the classic IDE error "variable referenced before assignment").

Maybe the use of b in the check of a could be handled implicitly, i.e. the fact that b exists is enough for building a depends_on dependency graph behind the scenes?

That still suffers from the readability issue.

One option could be having an entirely separate checks list separate from the columns for readability, something like this:

DataFrameSchema(
    columns={
        "a": Column(Float, nullable=True),
        "b": Column(String, nullable=True),
        "c": Column(String, nullable=True)
    },
    checks=[
        Check(lambda a: a > 0),
        Check(lambda a, b, c: a[(b == "foo") & (c == "bar")] > 10)
    ]
)

from pandera.

cosmicBboy commented on May 22, 2024

the separate checks idea is challenging because I think checks should be bound to the data provided by the columns keys. There really isn't a straightforward way to let a Check know what the function signature is besides associating a Check with column names in an explicit way.

Really, the intent behind checks that involve multiple columns is to enable things like hypothesis testing https://github.com/cosmicBboy/pandera/issues/17, which I think is the strongest use case for a solution where additional column dependencies are specified at the Check level, one possible proposal being the depends_on kwarg.

Basically the backend implementation of the one_sided_t_test described in #17 would be something like:

from scipy import stats

def two_sample_one_sided_ttest(group1, group2, relationship="gt", alpha=0.05, **kwargs):
    ttest = stats.ttest_ind(group1, group2, **kwargs)
    is_significant = ttest.pvalue / 2 < alpha
    # `relationship` refers to the relationship of group1 w.r.t. group2
    if relationship == "gt":
        return ttest.statistic > 0 and is_significant
    elif relationship == "lt":
        return ttest.statistic < 0 and is_significant
    else:
        raise ValueError("relationship %s not recognized" % relationship)

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Check(
                lambda w, s: two_sample_one_sided_ttest(w[s == 'men'], w[s == 'women']),
                depends_on=("sex")
            )
        ]
    ),
    "sex": Column(String, Check(lambda s: s.isin(["men", "women"]).all()))
})

Specifying depends_on before those columns are defined seems a bit odd on first read (it makes me think of the classic IDE error "variable referenced before assignment").

I think this makes sense for imperative program models, but the goal with pandera is that the schema definition is (somewhat) declarative, such that the properties of a dataframe are asserted up-front, and the details of how validation happens is abstracted away for the user... I say somewhat because the validators are still fairly low-level, e.g. asserting the values of a column are positive via lambda s: s > 0.

In this way we're not really defining columns in an imperative way, we're declaring that a Check associated with the column weight depends on this other column sex. What are the properties of the sex column? Well, the user would be expected to define these properties with {"sex": Column(...)}

Not sure if this thought process makes sense... we can continue mulling this over, but two of the things I'd like to shoot for are:

keeping the API as clear and concise as possible
discouraging an API that leads to over-specification of column properties

from pandera.

mastersplinter commented on May 22, 2024

I like your thinking - column wise checks follow the logic of "this column depends on these others", and fits the primary use case. The syntax gets clunky if a user writes too many overlapping checks, and that can easily be handled outside of Pandera if required.

I looked at implementing it and one observation: if the user made a mistake in specifying the lambda function or depends_on, it would be difficult to diagnose/debug:, e.g.:

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Check(
                lambda w, s, h, a: w+s+a+h,
                depends_on=("sex","age","height")
            )
        ]
    )
})

In the above function, the positional order of depends_on sets h=age, height=a which is probably not what the user intended. Not sure if or how to make this better.

One possible implementation leaves open the possibility of changing syntax in future relatively flexibly: passing a dataframe through SeriesSchemaBase and into Check.

So the call of Check goes from:

    def __call__(self, parent_schema, series, index):

and becomes this:

    def __call__(self, parent_schema, dataframe, index):

To make this work the whole dataframe could be passed into SeriesSchemaBase, and the existing code be adapted to only reference the named column, i.e. using df[self._name] currently in column.

Then modifying the way the checks are called in SeriesSchemaBase from:

        for i, check in enumerate(self._checks):
            check_results.append(check(self, series, i))

into something like:

        for i, check in enumerate(self._checks):
            check_results.append(check(self, df[this_column,depend_on], i))

Effectively by passing a Dataframe into Check, either the subsetted dataframe, or the full dataframe - instead of multiple series leaves open the possibility of extending DataFrameSchema to also have checks, or changing the syntax..

from pandera.

cosmicBboy commented on May 22, 2024

In the above function, the positional order of depends_on sets h=age, height=a which is probably not what the user intended. Not sure if or how to make this better.

good point, I do like the direction of making the API for specifying checks clearer, and I'd like to maintain the separation of the primary column (the one that is specified in the columns key) and the depends_on columns.

For example:

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Check(
                lambda w, d: w + d["sex"] + d["age"] + d["height"],
                depends_on=("sex","age","height")
            )
        ]
    )
})

basically specifying depends_on modifies the expected Check function signature to

(series, dataframe) -> bool|pd.Series[bool]

where dataframe contains the columns specified in depends_on.

I like this because weight is still a first-class argument to the check function and sex, age, height are contained within the second argument.

It's then the user's responsibility to use the contents in the dependencies dataframe to make an assertion about the primary column.

On a side-note, more complex check functions (with many dependencies and complex logic) should probably be defined before hand with the def myfunc(x, df) pattern.

from pandera.

add support for grouping column of interest by another column. about pandera HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent