The kaggle-mlsp-2014 from pureti

README

MLSP 2014 Schizophrenia Classification Challenge: 2nd position (solution)

Author: Alexander V. Lebedev, MD

(University of Bergen, Norway)

Date: 26/07/2014

1. Summary

The goal of the competition (https://www.kaggle.com/c/mlsp-2014-mri) was to automatically detect subjects with schizophrenia based on multimodal features derived from the magnetic resonance imaging (MRI) data. For this challenge, I implemented so-called "feature trimming", consisting of 1) introducing a random vector into the feature set, 2) calculating feature importance, 3) removing the features with importance below the "dummy feature". At the first step, I ran Random Forest [1] model and performed trimming based on Gini-index [2]. Then, after estimation of the inverse width parameter ("sigma"), I tuned C-parameter for my final model - Support Vector Machine with Gaussian Kernel (RBF-SVM) [3].

2. Feature Selection

The key step was "feature trimming". Further steps were also quite simple and none of any sophisticated approaches (like ensembling, hierarchical models) were implemented. Generally, I tried to keep the design as simple as possible due to limited number of subjects available in the training set and therefore being concerned about overfitting.

3. Code Description

3.1 Preparatory step:

3.1.1 Load the libraries:

library(caret)
library(randomForest)
library(e1071)
library(kernlab)
library(doMC)
library(foreach)
library(RColorBrewer)

3.1.2 Read the data:

# Training set:
trFC <- read.csv('/YOUR-PATH/Kaggle/SCH/Train/train_FNC.csv')
trSBM <- read.csv('/YOUR-PATH/Kaggle/SCH/Train/train_SBM.csv')
tr <- merge(trFC, trSBM, by='Id')

# Test set:
tstFC <- read.csv('/YOUR-PATH/Kaggle/SCH/Test/test_FNC.csv')
tstSBM <- read.csv('/YOUR-PATH/Kaggle/SCH/Test/test_SBM.csv')
tst <- merge(tstFC, tstSBM, by='Id')

y <- read.csv('/YOUR-PATH/Kaggle/SCH/Train/train_labels.csv')

3.2 Analysis

3.2.1 "Feature Trimming"

Registering 6 cores to speed up my computations:

registerDoMC(cores=6)

Converting a y-label vector into appropriate format:

y <- as.factor(paste('X.', y[,2], sep = ''))

Introducing a random vector into my feature set:

all <- cbind(tr, rnorm(1:dim(tr)[1]))
colnames(all)[412] <- 'rand'

Now I train Random Forest with this (full) feature set:

rf.mod <- foreach(ntree=rep(2500, 6), .combine=combine, .multicombine=TRUE,
                  .packages='randomForest') %dopar% {
                    randomForest(all[,2:412], y, ntree=ntree)
                  }

Looking at the feature importances:

color <- brewer.pal(n = 8, "Dark2")
imp <- as.data.frame(rf.mod$importance[order(rf.mod$importance),])
barplot(t(imp), col=color[1])
points(which(imp==imp['rand',]),0.6, col=color[2], type='h', lwd=2)

Everything below importance of our "dummy" feature (random vector) can likely be ignored. So, we "cut" everything that is on the left side of the orange line.

imp <- subset(imp, imp>imp['rand',])

Saving the data in one rda-file for further analyses:

save('all', 'y', 'tst', 'imp',  file = '/YOUR-PATH/Kaggle/SCH/Train/AllData.rda')

Now I reduce my feature set:

dat <- all[,rownames(imp)]

3.2.2 Final Model

I usually start from SVM and then proceed with ensemble methods. However, in this competition, the use of boosted trees did not result in superior performance and I stopped.

First, I estimate "sigma" (inverse width parameter for the RBF-SVM) (of note, sometimes I use a subset of my data, but here I used the whole training set due to its very limited size)

sigDist <- sigest(y ~ as.matrix(dat), data=dat, frac = 1)

Creating a tune grid for further C-parameter selection):

svmTuneGrid <- data.frame(.sigma = sigDist[1], .C = 2^(-20:100))

## Warning: row names were found from a short variable and have been
## discarded

And training the final RBF-SVM model:

svmFit <- train(dat,y,
                method = "svmRadial",
                preProc = c("center", "scale"),
                tuneGrid = svmTuneGrid,
                trControl = trainControl(method = "cv", number = 86, classProbs =  TRUE))

Making predictions:

ttst <- tst[,rownames(imp)]
predTst <- predict(svmFit, ttst, type='prob')
predTst <- predTst[,2]

Formatting submission:

pred <- cbind(as.integer(tst$Id), as.numeric(predTst))
colnames(pred) <- c('Id', 'Probability')

Writing:

write.table(pred, file = '/YOUR-PATH/Kaggle/SCH/submissions/submission_rbfSVM_RFtrimmed.csv', sep=',', quote=F, row.names=F, fileEncoding = 'UTF-16LE')

4. Dependencies

To execute the code the following libraries must be installed: caret [3], randomForest [4], e1071 [5], kernlab [6], doMC [7], foreach [8], RColorBrewer [9]

5. Additional Comments and Observations

In general, it was somewhat difficult to evaluate performance of the models, since there was a substantial mismatch between cross-validated accuracies and the feedback that I was receiving during my submissions. It was one of the reasons why I decided not to go further with feature selection and more complex modeling approaches.

6. References

[1] V.N. Vapnik (1995) The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc. New York, NY, USA;

[2] L. Breiman (2001) Random Forests. Machine Learning Volume 45, Number 1: 5-32;

[3] M. Kuhn. Contributions from Jed Wing SW, Andre Williams, Chris Keefer and Allan Engelhardt (2012) caret: Classification and Regression Training. R package version 5.15-023. http://cran.r-project.org/packages/caret/;

[4] L. Breiman, A. Cutler, R port by Andy Liaw and Matthew Wiener (2014). randomForest: Breiman and Cutler's random forests for classification and regression. http://cran.r-project.org/web/packages/randomForest/;

[5] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch, C-C. Chang. C-C. Lin. Misc Functions of the Department of Statistics (e1071), TU Wien. http://cran.r-project.org/web/packages/e1071/;

[6] A. Karatzoglou, A. Smola, K. Hornik (2013). kernlab: Kernel-based Machine Learning Lab. http://cran.r-project.org/web/packages/kernlab/;

[7] Revolution Analytics. doMC (2014): Foreach parallel adaptor for the multicore package. http://cran.r-project.org/web/packages/doMC/;

[8] Revolution Analytics, Steve Weston (2014). foreach: Foreach looping construct for R. http://cran.r-project.org/web/packages/foreach/;

[9] Erich Neuwirth (2011). RColorBrewer: ColorBrewer palettes. http://cran.r-project.org/web/packages/RColorBrewer/.

pureti / kaggle-mlsp-2014 Goto Github PK

kaggle-mlsp-2014's Introduction