Giter Site home page Giter Site logo

pradnya1208 / cross-validation-techniques Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.05 MB

This project aims to understand and implement all the cross validation techniques used in Machine Learning.

Jupyter Notebook 100.00%
cross-validation k-fold-cross-validation stratified-cross-validation hold-out-cross-validation leave-one-out-cross-validation loocv monte-carlo

cross-validation-techniques's Introduction

github linkedin tableau twitter

Cross validation Techniques

Overview:

One of the commonly encountered problems in software engineering is that our data sets are usually of limited size, i.e., a few hundred observations when we are lucky. Dividing the available data into a modeling set and a test set is usually difficult as it implies that either the test set is going to be too small to obtain representative and reliable results or the modeling set is going to be too small to build a refined predictive model. One reasonable compromise is to use a cross-validation procedure. To get an impression of how well the model performs when applied to different data sets, i.e., its prediction accuracy, a cross-validation should be carried out.

Depending on the availability and size of the data set, various cross-validation techniques can be used:

  • K-fold Cross-Validation
  • Stratified K-fold Cross-Validation
  • Holdout method
  • Leave-p-out Cross-Validation
  • Leave-one-out Cross-Validation
  • Monte carlo Cross-Validation

K-fold Cross-Validation:

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.


Checkout the K-fold implementation.

Stratified K-fold Cross-Validation:

Stratification seeks to ensure that each fold is representative of all strata of the data. Generally this is done in a supervised way for classification and aims to ensure each class is (approximately) equally represented across each test fold (which are of course combined in a complementary way to form training folds).


Checkout the Notebook for more details.

Holdout Method:

The holdout method is the simplest kind of cross validation. The data set is separated into two sets, called the training set and the testing set. The function approximator fits a function using the training set only. Then the function approximator is asked to predict the output values for the data in the testing set (it has never seen these output values before). The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to compute. However, its evaluation can have a high variance. The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made.


Check the implementation here

Leave-p-out Cross-Validation:

Another type of cross-validation is the Leave-p-out cross-validation method. Herein, the data sample comprises data points (n). The total number of data points (n) is used to separate a set of data points that is used for testing. These data points are referred to as (p). The training data set is obtained by calculating (n-p) and the model is trained accordingly. Once the training is done, p data points are used for cross-validation.

Check the code here

Leave-one-out Cross-Validation:

The Leave-One-Out Cross-Validation, or LOOCV, procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It is a computationally expensive procedure to perform, although it results in a reliable and unbiased estimate of model performance. Although simple to use and no configuration to specify, there are times when the procedure should not be used, such as when you have a very large dataset or a computationally expensive model to evaluate.


checkout the implementation here

Monte carlo Cross-Validation:

Monte Carlo cross-validation,creates multiple random splits of the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (i.e., the number of partitions). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap. This method also exhibits Monte Carlo variation, meaning that the results will vary if the analysis is repeated with different random splits.


Checkout the implementation here.

Implementation:

Libraries: NumPy pandas sklearn Matplotlib

Learnings:

Cross validation Techniques

References:

Cross validation techniques
Cross Validation

Feedback

If you have any feedback, please reach out at [email protected]

๐Ÿš€ About Me

Hi, I'm Pradnya! ๐Ÿ‘‹

I am an AI Enthusiast and Data science & ML practitioner

github linkedin tableau twitter

cross-validation-techniques's People

Contributors

pradnya1208 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.