Giter Site home page Giter Site logo

schlosslab / mikropml Goto Github PK

View Code? Open in Web Editor NEW
53.0 9.0 16.0 146.69 MB

User-Friendly R Package for Supervised Machine Learning Pipelines

Home Page: http://www.schlosslab.org/mikropml

License: Other

R 84.67% TeX 15.33%
machine-learning r-package rstats

mikropml's Introduction

mikropml

meek-ROPE em el

User-Friendly R Package for Supervised Machine Learning Pipelines

check codecov license CRAN Conda DOI

An interface to build machine learning models for classification and regression problems. mikropml implements the ML pipeline described by Topçuoğlu et al. (2020) with reasonable default options for data preprocessing, hyperparameter tuning, cross-validation, testing, model evaluation, and interpretation steps. See the website for more information, documentation, and examples.

Installation

You can install the latest release from CRAN:

install.packages('mikropml')

or the development version from GitHub:

# install.packages("devtools")
devtools::install_github("SchlossLab/mikropml")

or install from a terminal using conda or mamba:

mamba install -c conda-forge r-mikropml

Dependencies

  • Imports: caret, dplyr, e1071, glmnet, kernlab, MLmetrics, randomForest, rlang, rpart, stats, utils, xgboost
  • Suggests: assertthat, doFuture, forcats, foreach, future, future.apply, furrr, ggplot2, knitr, progress, progressr, purrr, rmarkdown, rsample, testthat, tidyr

Usage

Check out the introductory vignette for a quick start tutorial. For a more in-depth discussion, read all the vignettes and/or take a look at the reference documentation.

You can watch the Riffomonas Project series of video tutorials covering mikropml and other skills related to machine learning.

We also provide a Snakemake workflow for running mikropml locally or on an HPC. We highly recommend running mikropml with Snakemake or another workflow management system for reproducibility and scalability of ML analyses.

Help & Contributing

If you come across a bug, open an issue and include a minimal reproducible example.

If you have questions, create a new post in Discussions.

If you’d like to contribute, see our guidelines here.

Code of Conduct

Please note that the mikropml project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

The mikropml package is licensed under the MIT license. Text and images included in this repository, including the mikropml logo, are licensed under the CC BY 4.0 license.

Citation

To cite mikropml in publications, use:

Topçuoğlu BD, Lapp Z, Sovacool KL, Snitkin E, Wiens J, Schloss PD (2021). “mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines.” Journal of Open Source Software, 6(61), 3073. doi:10.21105/joss.03073, https://joss.theoj.org/papers/10.21105/joss.03073.

A BibTeX entry for LaTeX users is:

 @Article{,
  title = {{mikropml}: User-Friendly R Package for Supervised Machine Learning Pipelines},
  author = {Begüm D. Topçuoğlu and Zena Lapp and Kelly L. Sovacool and Evan Snitkin and Jenna Wiens and Patrick D. Schloss},
  journal = {Journal of Open Source Software},
  year = {2021},
  volume = {6},
  number = {61},
  pages = {3073},
  doi = {10.21105/joss.03073},
  url = {https://joss.theoj.org/papers/10.21105/joss.03073},
} 

Why the name?

The word “mikrop” (pronounced “meek-ROPE”) is Turkish for “microbe”. This package was originally implemented as a machine learning pipeline for microbiome-based classification problems (see Topçuoğlu et al. 2020). We realized that these methods are applicable in many other fields too, but stuck with the name because we like it!

mikropml's People

Contributors

agarretto96 avatar aj-kozik avatar btopcuoglu avatar courtneyarmour avatar github-actions[bot] avatar jmastough avatar jmoltzau avatar kelly-sovacool avatar lucas-bishop avatar nlesniak avatar pschloss avatar sbrifkin avatar sklucas avatar tomkoset avatar wclose avatar zenalapp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mikropml's Issues

Make this an R package

Reasons:

  • Install the package with devtools::install_github().
  • Use testthat with continuous integration for automated testing.
  • Generate documentation with roxygen2 & pkgdown.

Required steps:

  • roxygen2 comments (#58)
    • currently I inserted comment skeletons, but the actual documentation needs to be written before #58 is resolved.
  • pkg::fcn() syntax (#62)
  • fix modified caret models (#74)
    • Begüm removed the custom models for now.
  • pass devtools::check()

Remove obsolete code

There are code snippets and scripts that might not be used. Let's remove them to create a concise repo.

randomize features

some models (Logistic L2) have position dependent ranking issue, which can result in a high weighting of an all zeros feature. Plot attached is showing the red points (all-zero features) features correlate with rank/position and one of the last features gets weights high and ranked low. Randomizing feature order is able to eliminate weighting/ranking a feature will all zeros
logit_features_of_all_otus_otu_v_rank

Change default hyperparameters

  • Large range of values for default hyperparameters
  • Check the random forest hyperparameter selection in model_selection.R and modify if needed.

Fix data/caret_models structure and code/R/load_caret_models.R

Currently, there are R scripts in data/caret_models. Most of them are source code from the caret package, downloaded from GitHub. Two of them, svmLinear3.R and svmLinear4.R, are modified versions of the original code from caret. The script code/R/load_caret_models.R copies all of these files into the default R Library path. The purpose was to customize the svmLinear models. We need to preserve this purpose but fix some problems with the current implementation:

  • All R code should go in the R/ dir, per R Package requirements.
  • load_caret_models.R should not modify the R Library, as that will have unintended consequences for users who use caret in other projects. The ultimate solution will likely delete this script.
  • Any code from caret, modified or otherwise, should credit the original authors.

One potential solution is to fork the caret repo, modify those two files mentioned above, and include the forked version of caret as a dependency of this project (related SO post).

This issue must be fixed before this project can become an R package (#46).

update README to include --level in example scripts

Rscript code/R/main.R ... examples do not work, due to missing --level argument
And error for missing argument is unclear and not helpful to know need all arguments
Error:
usage: main.R --seed=<num> --model=<name> --data=<csv> --hyperparams=<csv>
--outcome=<colname> --level=<level> [--permutation]
usage: main.R --configfile=<yml>
usage: main.R --help
Execution halted

Add output option

Currently the pipeline outputs the model results to the data folder in the repository, but eventually the user will need to be able to run it from their own project directory and specify a directory to output the model results.

Add your name to README.md to test Gitflow

To test our Gitflow we will start by adding our names to the README.md. Everyone should clone the repo, pull, make a branch, add their name, and do a pull request for my review.

need to make correlation matrix name vary

If trying to run model on different subsets of the data or different levels of features, a different correlation matrix is needed, currently the same file would have to be overwritten which would prevent different datasets being run through the pipeline at the same time

Update README

The main README.md file has vestiges of the paper repo this was taken of. Not clear I am supposed to interface with the package

Add hyperparameter input file argument

dataframe where the first column is the name of the parameter, the second column is the parameter value and the third column is the model name

Example:

param val model
cost 0.1 L2_Logistic_Regression
cost 1 L2_Logistic_Regression
cost 10 L2_Logistic_Regression
sigma 0.1 L2_Logistic_Regression
  • Go to tuning_grid.R and look at how the hyperparameters are being defined.
  • Make an example file.
  • Make input flag for hyperparameter.
  • Change code in tuning_grid.R to take input file, read it in, and extract each type of hyperparameter and it's values (e.g. cost, sigma, etc.). This is instead of passing tuning_grid() a list as the hyperparameter value.
  • Add model column to hyperparameters
  • Test code to see if it works.
  • Documentation on how to use hyperparameter flag including what the input file should look like and exact values needed for each ML method.

Hardcoded correlation matrix

The correlation matrix used during permutation tests is built from OTU data, so when I tried to use my own assembly data to run a permutation, it didn't recognize any of the columns in the in the correlation matrix.

It would be nice to have the pipeline automatically generate a correlation matrix when using --permutation, but we could also specify how to do this in the documentation

Fix license

  • Typo in file name
  • Information not filled in

Warning about OTUs with no variation

It's a bit disconcerting to users to see a warning message that there are OTUs with no variation....

Warning in preProcess.default(data, method = "range") :
  No variation for for: Otu00174, Otu00225, Otu00244, Otu00250, Otu00256, Otu00277, Otu00290, Otu00294, Otu00302, Otu00308, Otu00318, Otu00319, Otu00328, Otu00339, Otu00340, Otu00344, Otu00348, Otu00360, Otu00368, Otu00370, Otu00371, Otu00375, Otu00385, Otu00397, Otu00398, Otu00400, Otu00405, Otu00410, Otu00413, Otu00419, Otu00423, Otu00424, Otu00427, Otu00433, Otu00443, Otu00456, Otu00462, Otu00467, Otu00468, Otu00469, Otu00473, Otu00474, Otu00485, Otu00491, Otu00493, Otu00501, Otu00502, Otu00507, Otu00510, Otu00525, Otu00534, Otu00540, Otu00545, Otu00546, Otu00548, Otu00551, Otu00552, Otu00554, Otu00555, Otu00559, Otu00560, Otu00562, Otu00573, Otu00576, Otu00581, Otu00586, Otu00587, Otu00593, Otu00598, Otu00599, Otu00603, Otu00610, Otu00615, Otu00616, Otu00631, Otu00632, Otu00635, Otu00640, Otu00647, Otu00650, Otu00652, Otu00653, Otu00656, Otu00660, Otu00669, Otu00670, Otu00672, Otu00674, Otu00675, Otu00677, Otu00678, Otu00680, Otu00682, Otu00686, Otu00689, Otu00690, Otu00694, Otu00696 [... truncated]

Would you entertain a PR to remove these automatically without a warning?

Write unit tests

  • Functionize code (“Every function should have a function.”)
  • Passing and failing test(s) for each function as a general rule of thumb.

Monitor the codecov report to detect code that isn't yet covered by a test.

Running Decision Tree causes error

[1] "Not running permutation importance" [1] "first outcome: normal" "second outcome: cancer" [1] "Machine learning formula:" dx ~ . <environment: 0x55698aa08028> [1] "Decision_Tree" Error in { : task 1 failed - "need at least two non-NA values to interpolate" Calls: run_model ... train.default -> nominalTrainWorkflow -> %op% -> <Anonymous> In addition: There were 50 or more warnings (use warnings() to see the first 50) Execution halted

command line run does not work due to bug with config file option

Getting an error when I run the test from the command line without the config file.
I run:

Rscript code/R/main.R --seed 1 --model L2_Logistic_Regression --data  test/data/small_input_data.csv --hyperparams test/data/hyperparams.csv --outcome dx

I get the error:

Error in read_yaml(args$configfile) : 
  'file' must be a character string or connection
Execution halted

If I comment out the config file section in main.R., then it works.

#if ("configfile" %in% names(args)) {
#  args <- read_yaml(args$configfile)
#}

Array jobs are not working with the current setup

#!/bin/bash

###############################
#                             #
#  1) Job Submission Options  #
#                             #
###############################

# Name
#SBATCH --job-name=L2Logistic


# Resources
# For MPI, increase ntasks-per-node
# For multithreading, increase cpus-per-task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --time=20:00:00


# Account
#SBATCH --account=pschloss1
#SBATCH --partition=standard

# Logs
#SBATCH [email protected]
#SBATCH --mail-type=FAIL

# Environment
##SBATCH --array=1-100
##SBATCH --export=ALL

# --array is array parameter, the same job will be submitted the length of the input,
# each with its own unique array id ($SLURM_ARRAY_TASK_ID)

# Load Modules:
#  1) R  2) Bioinformatics

#####################
#                   #
#  2) Job Commands  #
#                   #
#####################

# Vector index starts at 0 so shift array by one
seed=$(($SLURM_ARRAY_TASK_ID - 1))

# Print out which model is being run in each job
echo Using "L2 Logistic Regression"

# Using $SLURM_ARRAY_TASK_ID to select parameter set
Rscript code/R/main.R --seed $seed --model L2_Logistic_Regression --data  test/data/small_input_data.csv --hyperparams test/data/hyperparams.csv --outcome dx

NZV

Maybe consider adding Pat's code in model_pipeline.R, to remove warnings when there are near 0 variance OTUs.

	# Identify those columns that have no variance and then remove them. the following code assumes	
	# that the data frame has the outcome variable column first, as was set in the if-else code block	
	# above. This helps us avoid a nuisance warning that OTUs have zero variance.	
	non_zero_variance_cols <- logical()	
	non_zero_variance_cols[outcome] <- TRUE	
	non_zero_variance_cols <- c(non_zero_variance_cols, apply(data[,2:ncol(data)], 2, sd) != 0)	
	data <- data[,non_zero_variance_cols]

        preProcValues <- caret::preProcess(data, method = "range")	
       dataTransformed <- predict(preProcValues, data)

All inputs and outputs should be R objects

  • Remove hard-coded file paths from code
  • The user should provide e.g. a dataframe or other object in a specific format, not a file path. The user is responsible for reading the file or otherwise creating the object, not the package.
    • Outcome vector
    • Feature dataframe
    • Hyperparameter dataframe (optional)
    • Correlation matrix (or make this for them based on certain parameters?)
  • Functions should return objects; not write files.

Data file as an input in the command-line

RIght now, the datafile needed to create the model is not optional. You need to go into the code mainR to change it. We don't want that. We need to make sure the user can input the datafile.

  • Define what the data should look like: First column is the outcome, the rest of the columns are features.

  • Modify the code in main.R to take in the file.

  • Add a command-line argument for the data.

  • Check if code works:)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.