Giter Site home page Giter Site logo

preprocessy-logo

Workflow Maintenance Issues Open Forks Stars GitHub contributors PRs welcome MIT license

Preprocessy is a framework that provides data preprocessing pipelines for machine learning. It bundles all the common preprocessing steps that are performed on the data to prepare it for machine learning models. It aims to do so in a manner that is independent of the source and type of dataset. Hence, it provides a set of functions that have been generalised to different types of data.

The pipelines themselves are composed of these functions and flexible so that the users can customise them by adding their processing functions or removing pipeline functions according to their needs. The pipelines thus provide an abstract and high-level interface to the users.

Pipeline Structure

The pipelines are divided into 3 logical stages -

Stage 1 - Pipeline Input

Input datasets with the following extensions are supported - .csv, .tsv, .xls, .xlsx, .xlsm, .xlsb, .odf, .ods, .odt

Stage 2 - Processing

This is the major part of the pipeline consisting of processing functions. The following functions are provided out of the box as individual functions as well as a part of the pipelines -

  • Handling Null Values
  • Handling Outliers
  • Normalisation and Scaling
  • Label Encoding
  • Correlation and Feature Extraction
  • Training and Test set splitting

Stage 3 - Pipeline Output

The output consists of processed dataset and pipeline parameters depending on the verbosity required.

Contributing

Please read our Contributing Guide before submitting a Pull Request to the project.

Support

Feel free to contact any of the maintainers. We're happy to help!

Roadmap

Check out our roadmap to stay informed of the latest features released and the upcoming ones. Feel free to give us your insights!

Documentation

The documentation can be found at here. Currently, some parts of the documentation are under development. All contributions are welcome! Please see our Contributing Guide.

Research Paper and Citations

Preprocessy: A Customisable Data Preprocessing Framework with High-Level APIs was presented at the 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA) and is published in IEEE Xplore.

Link to full paper: https://ieeexplore.ieee.org/document/9736366

If you're using Preprocessy as a part of scientific research, please use the below citations.

Plain Text Citation

S. Kazi et al., "Preprocessy: A Customisable Data Preprocessing Framework with High-Level APIs," 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA), 2022, pp. 206-211, doi: 10.1109/CDMA54072.2022.00039.

BibTeX Citation

@INPROCEEDINGS{9736366,
  author={Kazi, Saif and Vakharia, Priyesh and Shah, Parth and Gupta, Riya and Tailor, Yash and Mantry, Palak and Rathod, Jash},
  booktitle={2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)},
  title={Preprocessy: A Customisable Data Preprocessing Framework with High-Level APIs},
  year={2022},
  volume={},
  number={},
  pages={206-211},
  doi={10.1109/CDMA54072.2022.00039}}

License

See the LICENSE file for licensing information.

Links

Preprocessy's Projects

preprocessy icon preprocessy

Python package for Customizable Data Preprocessing Pipelines

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.