Describe the bug Calling schema.example

Check whether you see more-diverse outputs if you actually run the test? Strate

closing this issue, <a class="user-mention notranslate" data-hovercard-type="user" dat

Hypothesis examples are all the same about pandera HOT 7 CLOSED

tmcclintock commented on July 19, 2024

Hypothesis examples are all the same

from pandera.

Comments (7)

cosmicBboy commented on July 19, 2024 1

Okay, so it seems like generating smaller dataframes yields higher entropy results:

print(schema.example(size=5))

# generates different datasets
               column1  column2        column3 column4
0                  152        1   9.007199e+15     BBB
1  9223372036854775807        1   1.192093e-07     CCC
2  4148323564460896226       56   6.189641e+16     BBB
3                  123       83   6.103516e-05     CCC
4                32240        2  1.112537e-308     BBB

print(schema.example(size=10))

# we see this consistently
   column1  column2  column3 column4
0    31078        1      0.0     AAA
1        0        1      0.0     AAA
2        0        1      0.0     AAA
3        0        1      0.0     AAA
4        0        1      0.0     AAA
5        0        1      0.0     AAA
6        0        1      0.0     AAA
7        0        1      0.0     AAA
8        0        1      0.0     AAA
9        0        1      0.0     AAA

@tmcclintock recommendations would be:

generate a bunch of smaller dataframes and concat them, it seems like dataframes of about size 5 is the magic number.
restrict your schemas to have only one check (this is pretty unreasonable though).

@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on filter, but that'll require a larger refactoring project.

from pandera.

cosmicBboy commented on July 19, 2024

Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g.

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, [Check.in_range(1, 100)]),  # 👈 use a single in_range check instead of ge and le
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

produces

6.100.1 0.0.0+dev0
                column1  column2        column3 column4
0                     0        1   3.402823e+38     AAA
1                     0        1   2.882304e+16     CCC
2                     0        6   2.000010e+00     BBB
3                   247       47   9.999900e-01     BBB
4                 19526       50  1.390036e+164     AAA
5                 56223       63  2.225074e-308     AAA
6                    42       15   7.357397e+15     BBB
7                    97       62   9.999900e-01     CCC
8                     0       69   3.293796e+09     AAA
9   9216616637413720064        4   1.000000e+07     AAA
10    23090105669335094       14   5.397605e-78     CCC
11                    0       50   1.192093e-07     CCC
12           1260840409       98   1.500000e+00     AAA
13                21966       68   1.100000e+00     AAA
14                23289       21   3.333333e-01     CCC
15   912854047966763290       27   6.519203e+16     BBB
16  8876389219764502267        9  5.706631e-178     CCC
17                40004       40   1.500000e+00     CCC
18                  247       77   5.742309e+16     BBB
19                47285       17   1.175494e-38     AAA

from pandera.

Zac-HD commented on July 19, 2024

Check whether you see more-diverse outputs if you actually run the test? Strategies' .example() method often biases simpler (for complicated internal reasons), and dataframes are typically 'sparse' as well - so you might get a fill-value and then few-or-no other values.
Eventually you're going to have to do that project, yeah. The filter-rewriting should be able to handle this case though, so I suspect that there's a simpler fix for this specific issue somewhere in Pandera.

from pandera.

tmcclintock commented on July 19, 2024

Thanks, both. Feel free to close this issue if you feel like it. FWIW, IMO the .example() API of pandera is one of its strongest features. I'd love for it to be performant one day!

from pandera.

cosmicBboy commented on July 19, 2024

It might make sense to bring back the warning that hypothesis raises with example. It's really meant more for interactively debugging and examining strategies, and not for any serious production context. The intended use of it really is as demonstrated here https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#usage-in-unit-tests.

from pandera.

tmcclintock commented on July 19, 2024

Yup, that's what I use it for. Pandera let's me mock entire machine learning pipelines. The issue is, usually I want more than 5 rows of mock data :).

from pandera.

cosmicBboy commented on July 19, 2024

closing this issue, @tmcclintock FYI I created #1625 to articulate what would be needed to improve the performance of pandera strategies overall.

from pandera.

Hypothesis examples are all the same about pandera HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent