schlosslab / mikropml Goto Github PK

View Code? Open in Web Editor NEW

53.0 9.0 16.0 146.69 MB

User-Friendly R Package for Supervised Machine Learning Pipelines

Home Page: http://www.schlosslab.org/mikropml

License: Other

R 84.67% TeX 15.33%

machine-learning r-package rstats

mikropml's Introduction

mikropml

meek-ROPE em el

User-Friendly R Package for Supervised Machine Learning Pipelines

An interface to build machine learning models for classification and regression problems. mikropml implements the ML pipeline described by Topçuoğlu et al. (2020) with reasonable default options for data preprocessing, hyperparameter tuning, cross-validation, testing, model evaluation, and interpretation steps. See the website for more information, documentation, and examples.

Installation

You can install the latest release from CRAN:

install.packages('mikropml')

or the development version from GitHub:

# install.packages("devtools")
devtools::install_github("SchlossLab/mikropml")

or install from a terminal using conda or mamba:

mamba install -c conda-forge r-mikropml

Dependencies

Imports: caret, dplyr, e1071, glmnet, kernlab, MLmetrics, randomForest, rlang, rpart, stats, utils, xgboost
Suggests: assertthat, doFuture, forcats, foreach, future, future.apply, furrr, ggplot2, knitr, progress, progressr, purrr, rmarkdown, rsample, testthat, tidyr

Usage

Check out the introductory vignette for a quick start tutorial. For a more in-depth discussion, read all the vignettes and/or take a look at the reference documentation.

You can watch the Riffomonas Project series of video tutorials covering mikropml and other skills related to machine learning.

We also provide a Snakemake workflow for running mikropml locally or on an HPC. We highly recommend running mikropml with Snakemake or another workflow management system for reproducibility and scalability of ML analyses.

Help & Contributing

If you come across a bug, open an issue and include a minimal reproducible example.

If you have questions, create a new post in Discussions.

If you’d like to contribute, see our guidelines here.

Code of Conduct

Please note that the mikropml project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

The mikropml package is licensed under the MIT license. Text and images included in this repository, including the mikropml logo, are licensed under the CC BY 4.0 license.

Citation

To cite mikropml in publications, use:

Topçuoğlu BD, Lapp Z, Sovacool KL, Snitkin E, Wiens J, Schloss PD (2021). “mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines.” Journal of Open Source Software, 6(61), 3073. doi:10.21105/joss.03073, https://joss.theoj.org/papers/10.21105/joss.03073.

A BibTeX entry for LaTeX users is:

 @Article{,
  title = {{mikropml}: User-Friendly R Package for Supervised Machine Learning Pipelines},
  author = {Begüm D. Topçuoğlu and Zena Lapp and Kelly L. Sovacool and Evan Snitkin and Jenna Wiens and Patrick D. Schloss},
  journal = {Journal of Open Source Software},
  year = {2021},
  volume = {6},
  number = {61},
  pages = {3073},
  doi = {10.21105/joss.03073},
  url = {https://joss.theoj.org/papers/10.21105/joss.03073},
}

Why the name?

The word “mikrop” (pronounced “meek-ROPE”) is Turkish for “microbe”. This package was originally implemented as a machine learning pipeline for microbiome-based classification problems (see Topçuoğlu et al. 2020). We realized that these methods are applicable in many other fields too, but stuck with the name because we like it!

mikropml's People

Contributors

Stargazers

Watchers

Forkers

jmastough wclose tomkoset lucas-bishop joycewang914 btopcuoglu minghao2016 jingmouren saracg-forks dimbage rnaimehaom ripenishtala megancoden shah-priyal liaochenlanruo adesols

mikropml's Issues

Rename functions to follow tidyverse style guide

Rename functions to follow tidyverse style guide. https://style.tidyverse.org/
E.g. model_pipeline -> run_ml
If a file contains only one function, the filename should be the same as the function name.

Make this an R package

Reasons:

Install the package with devtools::install_github().
Use testthat with continuous integration for automated testing.
Generate documentation with roxygen2 & pkgdown.

Required steps:

roxygen2 comments (#58)
- currently I inserted comment skeletons, but the actual documentation needs to be written before #58 is resolved.
pkg::fcn() syntax (#62)
fix modified caret models (#74)
- Begüm removed the custom models for now.
pass devtools::check()

compute_correlation_matrix.R uses package version sensitive function

compute_correlation_matrix.R uses all_of() to remove the outcome column, this function was added to tidyselect v1.0.0 on Jan 27 2020, so users with previous version of tidyselect will get error trying to run this script

Make command-line arguments flags instead of positional

Recommend docopt for parsing arguments.
- http://docopt.org/
- https://github.com/docopt/docopt.R

Make permutation importance optional and not hard-coded

Permutation importance is optional at the moment but perm=T does not work.

Remove obsolete code

There are code snippets and scripts that might not be used. Let's remove them to create a concise repo.

Convert to be an R package installable with `devtools::install_github`

See http://r-pkgs.had.co.nz for ideas

User-defined options as flags rather than positional arguments

Right now we call code/R/main.R with 4 positional arguments. Can we change that to flags?

Write contributing instructions.

Adapt from Girls Who Code: https://github.com/GWC-DCMB/GWC-DCMB/blob/master/CONTRIBUTING.md

Do not hardcode hyperparameters

User should be able to define hyperparemeters for the chosen modeling algorithm.

- For RF: look at Pat's script (lines 46-52): https://github.com/SchlossLab/Sze_SCFACRC_mBio_2019/blob/master/code/rf_classification.R
- LR: hyper-parameters can be passed as an argument to function for if we can't figure out a way to automate it
- XGBoost: hyper-parameters can be passed as an argument to function for if we can't figure out a way to automate it

randomize features

some models (Logistic L2) have position dependent ranking issue, which can result in a high weighting of an all zeros feature. Plot attached is showing the red points (all-zero features) features correlate with rank/position and one of the last features gets weights high and ranked low. Randomizing feature order is able to eliminate weighting/ranking a feature will all zeros

Change default hyperparameters

Large range of values for default hyperparameters
Check the random forest hyperparameter selection in model_selection.R and modify if needed.

Find way to implement both load_caret_models scripts

Change the script name to match the function name

Let me know if the function name doesn't make sense, or if you think we should change the function name.

Make sure the names are changed in all the other scripts

Change pbs scripts to slurm

Test data gives error with permutation importance

Likely due to correlation matrix

return named list from model_pipeline

So that it's easier to know what each element of the list is

Fix data/caret_models structure and code/R/load_caret_models.R

Currently, there are R scripts in data/caret_models. Most of them are source code from the caret package, downloaded from GitHub. Two of them, svmLinear3.R and svmLinear4.R, are modified versions of the original code from caret. The script code/R/load_caret_models.R copies all of these files into the default R Library path. The purpose was to customize the svmLinear models. We need to preserve this purpose but fix some problems with the current implementation:

All R code should go in the R/ dir, per R Package requirements.
load_caret_models.R should not modify the R Library, as that will have unintended consequences for users who use caret in other projects. The ultimate solution will likely delete this script.
Any code from caret, modified or otherwise, should credit the original authors.

One potential solution is to fork the caret repo, modify those two files mentioned above, and include the forked version of caret as a dependency of this project (related SO post).

This issue must be fixed before this project can become an R package (#46).

update README to include --level in example scripts

Rscript code/R/main.R ... examples do not work, due to missing --level argument
And error for missing argument is unclear and not helpful to know need all arguments
Error:
usage: main.R --seed=<num> --model=<name> --data=<csv> --hyperparams=<csv>
--outcome=<colname> --level=<level> [--permutation]
usage: main.R --configfile=<yml>
usage: main.R --help
Execution halted

Save data in results directory instead of data/temp

It's easier to find the results this way and follows best practices of naming things [Noble 2009].

Add output option

Currently the pipeline outputs the model results to the data folder in the repository, but eventually the user will need to be able to run it from their own project directory and specify a directory to output the model results.

Add your name to README.md to test Gitflow

To test our Gitflow we will start by adding our names to the README.md. Everyone should clone the repo, pull, make a branch, add their name, and do a pull request for my review.

need to make correlation matrix name vary

If trying to run model on different subsets of the data or different levels of features, a different correlation matrix is needed, currently the same file would have to be overwritten which would prevent different datasets being run through the pipeline at the same time

Update README

The main README.md file has vestiges of the paper repo this was taken of. Not clear I am supposed to interface with the package

Add hyperparameter input file argument

dataframe where the first column is the name of the parameter, the second column is the parameter value and the third column is the model name

Example:

param	val	model
cost	0.1	L2_Logistic_Regression
cost	1	L2_Logistic_Regression
cost	10	L2_Logistic_Regression
sigma	0.1	L2_Logistic_Regression

Go to tuning_grid.R and look at how the hyperparameters are being defined.
Make an example file.
Make input flag for hyperparameter.
Change code in tuning_grid.R to take input file, read it in, and extract each type of hyperparameter and it's values (e.g. cost, sigma, etc.). This is instead of passing tuning_grid() a list as the hyperparameter value.
Add model column to hyperparameters
Test code to see if it works.
Documentation on how to use hyperparameter flag including what the input file should look like and exact values needed for each ML method.

Create Snakemake workflow

Create Snakefile, config file, and environments for running analysis.

Add fast frugal trees

Create new models to be used
Test with real data

Use abbreviations or number notation for selecting model

It's difficult to avoid misspellings in the model names so it would be great to have some shorthand notation. Could be something like:

1 = L1_Linear...
2 = L2_Logistic...

l1 = L1_Linear...
l2 = L2_Logistic...

Write vignette to show how to run models in parallel

Hardcoded correlation matrix

The correlation matrix used during permutation tests is built from OTU data, so when I tried to use my own assembly data to run a permutation, it didn't recognize any of the columns in the in the correlation matrix.

It would be nice to have the pipeline automatically generate a correlation matrix when using --permutation, but we could also specify how to do this in the documentation

Fix license

Typo in file name
Information not filled in

Explicitly refer to external functions using the syntax package::function()

Wherever an external function is used, the package it came from should be referenced explicitly using the syntax package::function(). For example, when using read_csv from the readr package, you should call it like so: readr::read_csv('filename.csv').

See the R Packages book for more info.

Warning about OTUs with no variation

It's a bit disconcerting to users to see a warning message that there are OTUs with no variation....

Warning in preProcess.default(data, method = "range") :
  No variation for for: Otu00174, Otu00225, Otu00244, Otu00250, Otu00256, Otu00277, Otu00290, Otu00294, Otu00302, Otu00308, Otu00318, Otu00319, Otu00328, Otu00339, Otu00340, Otu00344, Otu00348, Otu00360, Otu00368, Otu00370, Otu00371, Otu00375, Otu00385, Otu00397, Otu00398, Otu00400, Otu00405, Otu00410, Otu00413, Otu00419, Otu00423, Otu00424, Otu00427, Otu00433, Otu00443, Otu00456, Otu00462, Otu00467, Otu00468, Otu00469, Otu00473, Otu00474, Otu00485, Otu00491, Otu00493, Otu00501, Otu00502, Otu00507, Otu00510, Otu00525, Otu00534, Otu00540, Otu00545, Otu00546, Otu00548, Otu00551, Otu00552, Otu00554, Otu00555, Otu00559, Otu00560, Otu00562, Otu00573, Otu00576, Otu00581, Otu00586, Otu00587, Otu00593, Otu00598, Otu00599, Otu00603, Otu00610, Otu00615, Otu00616, Otu00631, Otu00632, Otu00635, Otu00640, Otu00647, Otu00650, Otu00652, Otu00653, Otu00656, Otu00660, Otu00669, Otu00670, Otu00672, Otu00674, Otu00675, Otu00677, Otu00678, Otu00680, Otu00682, Otu00686, Otu00689, Otu00690, Otu00694, Otu00696 [... truncated]

Would you entertain a PR to remove these automatically without a warning?

Write unit tests

Functionize code (“Every function should have a function.”)
Passing and failing test(s) for each function as a general rule of thumb.

Monitor the codecov report to detect code that isn't yet covered by a test.

Running Decision Tree causes error

[1] "Not running permutation importance" [1] "first outcome: normal" "second outcome: cancer" [1] "Machine learning formula:" dx ~ . <environment: 0x55698aa08028> [1] "Decision_Tree" Error in { : task 1 failed - "need at least two non-NA values to interpolate" Calls: run_model ... train.default -> nominalTrainWorkflow -> %op% -> <Anonymous> In addition: There were 50 or more warnings (use warnings() to see the first 50) Execution halted

Add other performance metrics

~~[ ] Accuracy~~
~~[ ] F-1 score~~
AUROC
AUPRC (PR #72)

Save sensitivity and specificity for 0.5 threshold

Read through documentation and edit/add to make clearer

Add folder structure for different experiments

Incorporate Andy's addition of directory structure for different modeling experiments (e.g. run model for different taxa levels)

combine compute_correlation_matrix.R and generate_corr_matrix.R

generate_...R just sources and runs compute_...R, so could just add the input arguments to compute_....R

command line run does not work due to bug with config file option

Getting an error when I run the test from the command line without the config file.
I run:

Rscript code/R/main.R --seed 1 --model L2_Logistic_Regression --data  test/data/small_input_data.csv --hyperparams test/data/hyperparams.csv --outcome dx

I get the error:

Error in read_yaml(args$configfile) : 
  'file' must be a character string or connection
Execution halted

If I comment out the config file section in main.R., then it works.

#if ("configfile" %in% names(args)) {
#  args <- read_yaml(args$configfile)
#}

L1_Linear_SVM uses L2 loss function

Shouldn't it use L1?

https://github.com/SchlossLab/ML_pipeline_microbiome/blob/0f06d4835c2389875f4dafa090fd583b9c68bd6a/code/R/tuning_grid.R#L153-L157

Array jobs are not working with the current setup

#!/bin/bash

###############################
#                             #
#  1) Job Submission Options  #
#                             #
###############################

# Name
#SBATCH --job-name=L2Logistic


# Resources
# For MPI, increase ntasks-per-node
# For multithreading, increase cpus-per-task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --time=20:00:00


# Account
#SBATCH --account=pschloss1
#SBATCH --partition=standard

# Logs
#SBATCH [email protected]
#SBATCH --mail-type=FAIL

# Environment
##SBATCH --array=1-100
##SBATCH --export=ALL

# --array is array parameter, the same job will be submitted the length of the input,
# each with its own unique array id ($SLURM_ARRAY_TASK_ID)

# Load Modules:
#  1) R  2) Bioinformatics

#####################
#                   #
#  2) Job Commands  #
#                   #
#####################

# Vector index starts at 0 so shift array by one
seed=$(($SLURM_ARRAY_TASK_ID - 1))

# Print out which model is being run in each job
echo Using "L2 Logistic Regression"

# Using $SLURM_ARRAY_TASK_ID to select parameter set
Rscript code/R/main.R --seed $seed --model L2_Logistic_Regression --data  test/data/small_input_data.csv --hyperparams test/data/hyperparams.csv --outcome dx

NZV

Maybe consider adding Pat's code in model_pipeline.R, to remove warnings when there are near 0 variance OTUs.

	# Identify those columns that have no variance and then remove them. the following code assumes	
	# that the data frame has the outcome variable column first, as was set in the if-else code block	
	# above. This helps us avoid a nuisance warning that OTUs have zero variance.	
	non_zero_variance_cols <- logical()	
	non_zero_variance_cols[outcome] <- TRUE	
	non_zero_variance_cols <- c(non_zero_variance_cols, apply(data[,2:ncol(data)], 2, sd) != 0)	
	data <- data[,non_zero_variance_cols]

        preProcValues <- caret::preProcess(data, method = "range")	
       dataTransformed <- predict(preProcValues, data)

Make function comments compliant with roxygen2 syntax

In order to generate the documentation with roxygen2 (related to #46), the comment blocks that document each function will need to be made compliant with the roxygen2 syntax. See the man chapter from Hadley's R Packages book for guidelines and an example from code club here.

All inputs and outputs should be R objects

Remove hard-coded file paths from code
The user should provide e.g. a dataframe or other object in a specific format, not a file path. The user is responsible for reading the file or otherwise creating the object, not the package.
- Outcome vector
- Feature dataframe
- Hyperparameter dataframe (optional)
- Correlation matrix (or make this for them based on certain parameters?)
Functions should return objects; not write files.

Data file as an input in the command-line

RIght now, the datafile needed to create the model is not optional. You need to go into the code mainR to change it. We don't want that. We need to make sure the user can input the datafile.

Define what the data should look like: First column is the outcome, the rest of the columns are features.
Modify the code in main.R to take in the file.
Add a command-line argument for the data.
Check if code works:)

Work on this one only after resolving #4.
Suggest using the YAML format for config files.
New usage would be something like: Rscript main.R --configfile config.yml