wesselhuising / pandantic Goto Github PK

Enriches the Pydantic BaseModel class by adding the ability to validate dataframes using the schema and custom validators of the same BaseModel class.

Home Page: https://pandantic-rtd.readthedocs.io

Makefile 0.30% Python 99.70%

dataframes pandas pydantic validation

pandantic's People

Contributors

Stargazers

Watchers

Forkers

ldacey nikolaysimakov

pandantic's Issues

License

Currently there is no license statement. Could you please add a LICENSE file that contains the license text.

Thanks.

on failure, optionally provide data on errors (by row & reason)

at the moment data issues can either cause an Exception or for data to be dropped, depending on args used.

Having a way to get data back on where and why errors occurred across a whole DF would be useful

Thanks!

issue installin in python 3.8

I can't seem to be able to install the package using python 3.8, does it have specific versions it works with?

The error I get is :
pip install pandantic
Collecting pandantic
ERROR: Could not find a version that satisfies the requirement pandantic (from versions: none)
ERROR: No matching distribution found for pandantic

I can't find the package itself on pip as well - is this not available atm or any more?

Support for Polars Dataframes

pandantic seemed to be such a nice and simple implementation that I decided edit your model to use with Polars Dataframes and figured I would share the results.

I only recently began using polars so there might be more efficient ways, but here were the changes I had to make to your model:

There is no index, so replaced it using with_row_count() to get the row number for errors
Chunk logic can be handled by iter_slices() where the n_rows can be determined by the total rows / CPU count
Instead of to_dict(), we use iter_rows(named=True) to pass each row into the validator
We use filter() to exclude the error rows if the errors is set to "filter"

from multiprocess import Process, Queue, cpu_count
import polars as pl
import math
import os
from pydantic import BaseModel
import logging

class PolarsModel(BaseModel):

    @classmethod
    def parse_df(
        cls,
        dataframe: pl.DataFrame,
        errors: str = "raise",
        context: dict[str, object] | None = None,
        n_jobs: int = 1,
        verbose: bool = True,
    ) -> pl.DataFrame:
        
        errors_index = []
        dataframe = dataframe.clone().with_row_count()
        
        logging.info(f"Validating {dataframe.height} rows")
        logging.debug(f"Amount of available cores: {cpu_count()}")
        
        if n_jobs != 1:
            if n_jobs < 0:
                n_jobs = cpu_count()

            chunk_size = math.ceil(len(dataframe) / n_jobs)
            chunks = list(dataframe.iter_slices(n_rows=chunk_size))
            total_chunks = len(chunks)

            logging.info(f"Split the dataframe into {total_chunks} chunks to process {chunk_size} rows per chunk.")

            processes = []
            q = Queue()

            for chunk in chunks:
                p = Process(target=cls._validate_row, args=(chunk, q, context, verbose), daemon=True)
                p.start()
                processes.append(p)

            num_stops = 0
            while num_stops < total_chunks:
                index = q.get()
                if index is None:
                    num_stops += 1
                else:
                    errors_index.append(index)

            for p in processes:
                p.join()

        else:
            for row in dataframe.iter_rows(named=True):
                try:
                    cls.model_validate(obj=row, context=context)
                except Exception as exc:
                    if verbose:
                        logging.info(f"Validation error found at row {row['row_nr']}\n{exc}")
                    errors_index.append(row["row_nr"])

        logging.info(f"# invalid rows: {len(errors_index)}")

        if len(errors_index) > 0 and errors == "raise":
            raise ValueError(f"{len(errors_index)} validation errors found in dataframe.")
            
        if len(errors_index) > 0 and errors == "filter":
            return dataframe.filter(~pl.col("row_nr").is_in(errors_index)).drop(columns=["row_nr"])

        return dataframe.drop(columns=["row_nr"])

    @classmethod
    def _validate_row(cls, chunk: pl.DataFrame, q: Queue, context=None, verbose=True) -> None:
        for row in chunk.iter_rows(named=True):
            try:
                cls.model_validate(obj=row, context=context)
            except Exception as exc:
                if verbose:
                    logging.info(f"Validation error found at row {row['row_nr']}\n{exc}")
                q.put(row["row_nr"])
        q.put(None)

I tested this on a dataframe which I duplicated a bunch of times until the row count was > 1 million rows to check if n_jobs was functioning correctly.

With n_jobs = 1:

With n_jobs = 4 (twice as fast):

Example validation error if verbose=True:

Example with errors="filter", the resulting dataframe has the expected rows:

Construct as a Pandas plugin

Hi! So I was thinking of making a very similar project with one core difference: having the validator function as a Pandas plugin that takes a Pydantic BaseModel or Dataclass as an input.

For example:

df.pandantic.validate(schema: pydantic.BaseModel | pydantic.dataclasses.dataclass)

See: https://pandas.pydata.org/docs/development/extending.html

Wondering what you think about this refactor? I like the idea of being more agnostic to the type of Pydantic schema object being passed in, as Dataclasses are more analogous to a pandas data frame.

Additionally, it allows one to import and use normal Pydantic, instead of a wrapper. Normal pandas can be used too given the plugin is imported.

If you are amenable to this idea, I am happy to make a PR. Otherwise I may just make my own project pandas-pydantic. I would keep your logic largely the same, and test whether it works with dataclasses as well.

Optional columns in the dataframe

Hello! I'm doing some testing to this library (looks promising) but I found out that If i want an optional column in the dataframe the validation fails.

Imagine a define an schema with two columns: A and B but I want to validate a dataset that contains both columns but another one that contains only column A.

Right now, even if I set the column as optional, pandantic expects the column to exist in the dataframe, if the column is missing it raises an error.

Do you plan to implement something like this in the near future? And what about the opposite, complain if there are columns that are not defined in the schema?

Thanks!

use df.itertuples instead of df.to_dict()

Hi there, as a key motivation here is performance, you should consider using itertuples instead of to_dict for iterating over the dataframe, as I've found the snippet below to be about 3x faster

col_names = list(df.columns)
dict_iter = (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))

wesselhuising / pandantic Goto Github PK

pandantic's People

Contributors

Stargazers

Watchers

Forkers

pandantic's Issues

License

on failure, optionally provide data on errors (by row & reason)

issue installin in python 3.8

Support for Polars Dataframes

Construct as a Pandas plugin

Optional columns in the dataframe

use df.itertuples instead of df.to_dict()

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent