Comments (7)
Okay, so it seems like generating smaller dataframes yields higher entropy results:
print(schema.example(size=5))
# generates different datasets
column1 column2 column3 column4
0 152 1 9.007199e+15 BBB
1 9223372036854775807 1 1.192093e-07 CCC
2 4148323564460896226 56 6.189641e+16 BBB
3 123 83 6.103516e-05 CCC
4 32240 2 1.112537e-308 BBB
print(schema.example(size=10))
# we see this consistently
column1 column2 column3 column4
0 31078 1 0.0 AAA
1 0 1 0.0 AAA
2 0 1 0.0 AAA
3 0 1 0.0 AAA
4 0 1 0.0 AAA
5 0 1 0.0 AAA
6 0 1 0.0 AAA
7 0 1 0.0 AAA
8 0 1 0.0 AAA
9 0 1 0.0 AAA
@tmcclintock recommendations would be:
- generate a bunch of smaller dataframes and concat them, it seems like dataframes of about size 5 is the magic number.
- restrict your schemas to have only one check (this is pretty unreasonable though).
@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on filter
, but that'll require a larger refactoring project.
from pandera.
Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g.
schema = DataFrameSchema(
{
"column1": Column(int, Check.ge(0)),
"column2": Column(int, [Check.in_range(1, 100)]), # 👈 use a single in_range check instead of ge and le
"column3": Column(float, Check.ge(0)),
"column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
}
)
produces
6.100.1 0.0.0+dev0
column1 column2 column3 column4
0 0 1 3.402823e+38 AAA
1 0 1 2.882304e+16 CCC
2 0 6 2.000010e+00 BBB
3 247 47 9.999900e-01 BBB
4 19526 50 1.390036e+164 AAA
5 56223 63 2.225074e-308 AAA
6 42 15 7.357397e+15 BBB
7 97 62 9.999900e-01 CCC
8 0 69 3.293796e+09 AAA
9 9216616637413720064 4 1.000000e+07 AAA
10 23090105669335094 14 5.397605e-78 CCC
11 0 50 1.192093e-07 CCC
12 1260840409 98 1.500000e+00 AAA
13 21966 68 1.100000e+00 AAA
14 23289 21 3.333333e-01 CCC
15 912854047966763290 27 6.519203e+16 BBB
16 8876389219764502267 9 5.706631e-178 CCC
17 40004 40 1.500000e+00 CCC
18 247 77 5.742309e+16 BBB
19 47285 17 1.175494e-38 AAA
from pandera.
- Check whether you see more-diverse outputs if you actually run the test? Strategies'
.example()
method often biases simpler (for complicated internal reasons), and dataframes are typically 'sparse' as well - so you might get a fill-value and then few-or-no other values. - Eventually you're going to have to do that project, yeah. The filter-rewriting should be able to handle this case though, so I suspect that there's a simpler fix for this specific issue somewhere in Pandera.
from pandera.
Thanks, both. Feel free to close this issue if you feel like it. FWIW, IMO the .example()
API of pandera is one of its strongest features. I'd love for it to be performant one day!
from pandera.
It might make sense to bring back the warning that hypothesis
raises with example
. It's really meant more for interactively debugging and examining strategies, and not for any serious production context. The intended use of it really is as demonstrated here https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#usage-in-unit-tests.
from pandera.
Yup, that's what I use it for. Pandera let's me mock entire machine learning pipelines. The issue is, usually I want more than 5 rows of mock data :).
from pandera.
closing this issue, @tmcclintock FYI I created #1625 to articulate what would be needed to improve the performance of pandera strategies overall.
from pandera.
Related Issues (20)
- Try_Pandera edits to be more clear and beginner friendly HOT 2
- Validate on Initialization doesn't work in 3.11.9 and 3.12.3 HOT 6
- Annotated parametrized dtypes error on version >= 0.19.0 HOT 3
- Allow use of generic pa.DataFrameSchema/Model for different supported libraries HOT 2
- Time-agnostic DateTime with pandera-native polars datatype using DataFrameModel not working HOT 2
- Cannot call `get_metadata` on a DataFrameModel if there is a Config without a metadata attribute
- NaNs in boolean column coerced to True, nullable and default parameters are ignored
- Pandera is very slow to import when optional dependencies are installed HOT 2
- Missing `reason_code` when using custom checks with PySpark dataframes HOT 1
- Finite values in `pl.DataFrame` HOT 2
- Optional import hypotheses doesn't install hypothesis HOT 3
- Custom Check Methods don't support custom error (any more)
- Unexpected behavior when validating date objects. pandera=0.19.1
- Compatibility issues with Pandas HOT 3
- pandera not compatible with numpy 2.0 HOT 2
- `SchemaFieldNotFoundError` with custom check function if no alias is provided.
- Adding missing columns with a string default
- Scalar return for check in polars-backed model fails on validation with `lazy=True`
- Setting `coerce` on a column causes the column to be `required` when `required=False` HOT 1
- Support Data synthesis strategies for polars
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.