Comments (3)
Nice find! Looks like a bug in the missing column insertion logic that occurs when multiple columns not in the schema, in this case "col_b" and "col_c", are positioned after the missing column location in the dataframe to be validated. Thank you for the great working example. I'll submit a pull request shortly.
from pandera.
Thank you for the investigation and quick fix!
I'm waiting for your pull request to be merged :)
from pandera.
First, thank you very much for this fantastic package.
The code in OP's example runs as intended now for pandera 0.18.0, but adding non-unique column names causes a similar column-duplication problem.
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
{
"col_a": pa.Column(str),
"col_missing": pa.Column(str, nullable=True)
},
add_missing_columns=True
)
df = pd.DataFrame({
"col_a": ["a", "b", "c"],
"col_b": ["d", "e", "f"]
})
print(schema.validate(df))
# -> works well
# col_a col_missing col_b
# 0 a None d
# 1 b None e
# 2 c None f
df.columns = ["col_a", "col_a"]
print(schema.validate(df))
# -> duplicates columns
# col_a col_a col_missing col_a col_a
# 0 a d None a d
# 1 b e None b e
# 2 c f None c f
Expected behavior
add only 1 col_missing
# col_a col_a col_missing
# 0 a d None
# 1 b e None
# 2 c f None
from pandera.
Related Issues (20)
- Update branch name mentioned within bug report template HOT 1
- Custom DTypes With Polars HOT 3
- Error Importing Pandera with Polars extra HOT 2
- Add a polars `Series` type HOT 10
- Allow check type HOT 2
- How to load schema from pyspark struct or avro format from schema registry ? HOT 2
- How to correctly install a release v0.19.0b3 HOT 2
- Support Series generation with serial dependence HOT 1
- Incorrect validation passes pandera=0.19.0b3 HOT 1
- failure_case conversion failed : polars.exceptions.ComputeError - pandera(0.19.0b3) with polars HOT 5
- Incorrect Pandera Polars DataFrameModel Type Coercion Logic HOT 5
- Pandera Polars datatype 'check' method is not provided a 'data_container' HOT 6
- unique Field argument not yet implemented for pyspark HOT 1
- Improve strategies internals: accumulate check statisics instead of filtering
- Nullability for `pl.Float64` in `pl.DataFrame` fails HOT 1
- Try_Pandera edits to be more clear and beginner friendly HOT 2
- Validate on Initialization doesn't work in 3.11.9 and 3.12.3 HOT 6
- Annotated parametrized dtypes error on version >= 0.19.0 HOT 3
- Allow use of generic pa.DataFrameSchema/Model for different supported libraries HOT 2
- Time-agnostic DateTime with pandera-native polars datatype using DataFrameModel not working HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.