Giter Site home page Giter Site logo

datasail's Introduction

DataSAIL: Data Splitting Against Information Leaking

testing docs-image codecov anaconda update platforms license downloads Python 3.10

DataSAIL is a tool that splits data while minimizing the information leakage. This tool formulates the splitting of a dataset as constrained minimization problem and computes the assignment of data points to splits while minimizing the objective function that accounts for information leakage.

Internally, DataSAIL uses disciplined quasi-convex programming and binary quadratic programs to formulate the optimization task. To solve this DataSAIL relies on SCIP, one of the fastest non-commercial solvers for this type of problems and MOSEK, a commercial solver that distributes free licenses for academic use.

Apart from the here presented short overview, you can find a more detailed explanation of the tool on ReadTheDocs.

Installation

DataSAIL is installable from conda using mamba. using

conda create -n sail -c conda-forge -c kalininalab -c bioconda MPP
conda activate sail
pip install grakel

to install it into a new empty environment or

conda install -c conda-forge -c kalininalab -c bioconda MPP
pip install grakel

to install DataSAIL in an already existing environment. Due to dependencies of the clustering algorithms, the latter might lead to package conflicts with the already installed packages and requirements.

DataSAIL is available from Python 3.8 and newer.

Usage

DataSAIL is installed as a commandline tool. So, in the conda environment DataSAIL has been installed to, you can run

sail --e-type P --e-data <path_to_fasta> --e-sim mmseqs --output <path_to_output_path> --technique C1e

to split a set of proteins that have been clustered using mmseqs. For a full list of arguments run sail -h and checkout ReadTheDocs.

When to use DataSAIL and when not to use

One can distinguish two main ways to train a machine learning model on biological data.

  • Either the model shall be applied to data that is substantially different from the data to train on. In this case it is important to have test cases that model this real world application scenario properly by being as dissimilar as possible to the training data.
  • Or the training dataset already covers the full space of possible samples shown to the model.

DataSAIL is created to compute complex splits of the data by separating data based on similarities. This creates complex data-splits for the first scenario. Therefore, use DataSAIL when your model is applied to data that is different from your training data but not if the data in application is more or less the same as in the training.

datasail's People

Contributors

alexandergress avatar old-shatterhand avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.