Giter Site home page Giter Site logo

wesselhuising / pandantic Goto Github PK

View Code? Open in Web Editor NEW
26.0 26.0 3.0 253 KB

Enriches the Pydantic BaseModel class by adding the ability to validate dataframes using the schema and custom validators of the same BaseModel class.

Home Page: https://pandantic-rtd.readthedocs.io

Makefile 0.30% Python 99.70%
dataframes pandas pydantic validation

pandantic's People

Contributors

wesselhuising avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pandantic's Issues

License

Currently there is no license statement. Could you please add a LICENSE file that contains the license text.

Thanks.

issue installin in python 3.8

I can't seem to be able to install the package using python 3.8, does it have specific versions it works with?

The error I get is :
pip install pandantic
Collecting pandantic
ERROR: Could not find a version that satisfies the requirement pandantic (from versions: none)
ERROR: No matching distribution found for pandantic

I can't find the package itself on pip as well - is this not available atm or any more?

Support for Polars Dataframes

pandantic seemed to be such a nice and simple implementation that I decided edit your model to use with Polars Dataframes and figured I would share the results.

I only recently began using polars so there might be more efficient ways, but here were the changes I had to make to your model:

  1. There is no index, so replaced it using with_row_count() to get the row number for errors
  2. Chunk logic can be handled by iter_slices() where the n_rows can be determined by the total rows / CPU count
  3. Instead of to_dict(), we use iter_rows(named=True) to pass each row into the validator
  4. We use filter() to exclude the error rows if the errors is set to "filter"
from multiprocess import Process, Queue, cpu_count
import polars as pl
import math
import os
from pydantic import BaseModel
import logging

class PolarsModel(BaseModel):

    @classmethod
    def parse_df(
        cls,
        dataframe: pl.DataFrame,
        errors: str = "raise",
        context: dict[str, object] | None = None,
        n_jobs: int = 1,
        verbose: bool = True,
    ) -> pl.DataFrame:
        
        errors_index = []
        dataframe = dataframe.clone().with_row_count()
        
        logging.info(f"Validating {dataframe.height} rows")
        logging.debug(f"Amount of available cores: {cpu_count()}")
        
        if n_jobs != 1:
            if n_jobs < 0:
                n_jobs = cpu_count()

            chunk_size = math.ceil(len(dataframe) / n_jobs)
            chunks = list(dataframe.iter_slices(n_rows=chunk_size))
            total_chunks = len(chunks)

            logging.info(f"Split the dataframe into {total_chunks} chunks to process {chunk_size} rows per chunk.")

            processes = []
            q = Queue()

            for chunk in chunks:
                p = Process(target=cls._validate_row, args=(chunk, q, context, verbose), daemon=True)
                p.start()
                processes.append(p)

            num_stops = 0
            while num_stops < total_chunks:
                index = q.get()
                if index is None:
                    num_stops += 1
                else:
                    errors_index.append(index)

            for p in processes:
                p.join()

        else:
            for row in dataframe.iter_rows(named=True):
                try:
                    cls.model_validate(obj=row, context=context)
                except Exception as exc:
                    if verbose:
                        logging.info(f"Validation error found at row {row['row_nr']}\n{exc}")
                    errors_index.append(row["row_nr"])

        logging.info(f"# invalid rows: {len(errors_index)}")

        if len(errors_index) > 0 and errors == "raise":
            raise ValueError(f"{len(errors_index)} validation errors found in dataframe.")
            
        if len(errors_index) > 0 and errors == "filter":
            return dataframe.filter(~pl.col("row_nr").is_in(errors_index)).drop(columns=["row_nr"])

        return dataframe.drop(columns=["row_nr"])

    @classmethod
    def _validate_row(cls, chunk: pl.DataFrame, q: Queue, context=None, verbose=True) -> None:
        for row in chunk.iter_rows(named=True):
            try:
                cls.model_validate(obj=row, context=context)
            except Exception as exc:
                if verbose:
                    logging.info(f"Validation error found at row {row['row_nr']}\n{exc}")
                q.put(row["row_nr"])
        q.put(None)

I tested this on a dataframe which I duplicated a bunch of times until the row count was > 1 million rows to check if n_jobs was functioning correctly.

With n_jobs = 1:
image

With n_jobs = 4 (twice as fast):
image

Example validation error if verbose=True:
image

Example with errors="filter", the resulting dataframe has the expected rows:

image

Construct as a Pandas plugin

Hi! So I was thinking of making a very similar project with one core difference: having the validator function as a Pandas plugin that takes a Pydantic BaseModel or Dataclass as an input.

For example:

df.pandantic.validate(schema: pydantic.BaseModel | pydantic.dataclasses.dataclass)

See: https://pandas.pydata.org/docs/development/extending.html

Wondering what you think about this refactor? I like the idea of being more agnostic to the type of Pydantic schema object being passed in, as Dataclasses are more analogous to a pandas data frame.

Additionally, it allows one to import and use normal Pydantic, instead of a wrapper. Normal pandas can be used too given the plugin is imported.

If you are amenable to this idea, I am happy to make a PR. Otherwise I may just make my own project pandas-pydantic. I would keep your logic largely the same, and test whether it works with dataclasses as well.

Optional columns in the dataframe

Hello! I'm doing some testing to this library (looks promising) but I found out that If i want an optional column in the dataframe the validation fails.

Imagine a define an schema with two columns: A and B but I want to validate a dataset that contains both columns but another one that contains only column A.

Right now, even if I set the column as optional, pandantic expects the column to exist in the dataframe, if the column is missing it raises an error.

Do you plan to implement something like this in the near future? And what about the opposite, complain if there are columns that are not defined in the schema?

Thanks!

use df.itertuples instead of df.to_dict()

Hi there, as a key motivation here is performance, you should consider using itertuples instead of to_dict for iterating over the dataframe, as I've found the snippet below to be about 3x faster

col_names = list(df.columns)
dict_iter = (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.