Giter Site home page Giter Site logo

btopcuoglu / topcuoglu_ml_mbio_2020 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from schlosslab/topcuoglu_ml_mbio_2020

0.0 1.0 0.0 380.78 MB

Best practices for applying machine learning to bacterial 16S rRNA gene sequencing data

Home Page: https://mbio.asm.org/content/11/3/e00434-20

License: MIT License

Python 11.22% R 73.88% TeX 13.34% Makefile 0.68% Shell 0.88%

topcuoglu_ml_mbio_2020's Introduction

A framework for effective application of machine learning to microbiome-based classification problems

Abstract

Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.

Overview

project
|- README         		# the top level description of content (this doc)
|- CONTRIBUTING    		# instructions for how to contribute to your project
|- LICENSE         		# the license for this project
|
|- data/           		# raw and primary data, are not changed once created
| |- process/     		# .tsv and .csv files generated with main.R that runs the models
| |- baxter.0.03.subsample.shared      	# subsampled mothur generated file with OTUs from Marc Sze's analysis
| |- metadata.tsv     		        # metadata with clinical information from Marc Sze's analysis 		
|- code/          			# any programmatic code
| |- learning/    			# generalization performance of model
| |- testing/     			# building final model
|
|- results/        			# all output from workflows and analyses
| |- tables/      			# tables and .Rmd code of the tables to be rendered with kable in R
| |- figures/     			# graphs, likely designated for manuscript figures
|
|- submission/
| |- manuscript.Rmd 			# executable Rmarkdown for this study, if applicable
| |- manuscript.md 			# Markdown (GitHub) version of the *.Rmd file
| |- manuscript.tex 			# TeX version of *.Rmd file
| |- manuscript.pdf 			# PDF version of *.Rmd file
| |- header.tex 			# LaTeX header file to format pdf version of manuscript
| |- references.bib 			# BibTeX formatted references
|
|- Makefile	 # Reproduce the manuscript, figures and tables

How to use the outlined ML pipeline for your own project

How to regenerate this repository in R

Please take a look at the Makefile for more information about the workflow. Please also read the submission/manuscript.pdf to get a more detailed look on what we achieve with this ML pipeline.

  1. Clone the Github Repository and change directory to the project directory.
git clone https://github.com/SchlossLab/Topcuoglu_ML_XXX_2019.git
cd DeepLearning
  1. Our dependencies:

    • R version 3.5.0

    • The R packages which needs to be installed in our environment: caret ,rpart, xgboost, randomForest, kernlab,LiblineaR, pROC, tidyverse, cowplot, ggplot2, vegan,gtools, reshape2.

    • Everything needs to be run from project directory.

    • We need to download 2 datasets (OTU abundances and colonoscopy diagnosis of 490 patients) from *Sze MA, Schloss PD. 2018. Leveraging existing 16S rRNA gene surveys to identify reproducible biomarkers in individuals with colorectal tumors. mBio 9:e00630โ€“18.

    • We update the caret package with my modifications by running (Take a look at this script to change the R packages directory where caret is installed.):

      Rscript code/learning/load_caret_models.R

    These modifications are in data/caret_models/svmLinear3.R and data/caret_models/svm_Linear4.R

  2. Follow the Makefile to generate the manuscript.

    • The Makefile uses code/learning/main.R to run the pipeline which sources 4 other scripts that are part of the pipeline.

      • To choose the model and model hyperparemeters:code/learning/model_selection.R'

      • To preprocess and split the dataset 80-20 and to train the model: code/learning/model_pipeline.R

      • To save the results of each model for each datasplit: code/learning/generateAUCs.R

      • To interpret the models: code/learning/permutation_importance.R

topcuoglu_ml_mbio_2020's People

Contributors

btopcuoglu avatar pschloss avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.