rprops / phenoflow_package Goto Github PK
View Code? Open in Web Editor NEWR package offering functionality for the advanced analysis of microbial flow cytometry data
License: GNU General Public License v2.0
R package offering functionality for the advanced analysis of microbial flow cytometry data
License: GNU General Public License v2.0
Direct analysis of growth curves from FCM data.
Build passed Travic CI but failed to install on unix platform:
Need to troubleshoot (problem lies with flowCore installation).
Previous mention: http://stackoverflow.com/questions/40721182/error-in-installing-flowcore-package-r
UNIX information:
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.2.1511 (Core)
Release: 7.2.1511
Codename: Core
Everything installs smoothly on cmet server:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
When multi-threading with diversity_rf for certain samples the memory footprint gets excessive.
We should consider:
A similarity-sensitive diversity metric may be of added value. Phyloseq's data structure allows including a phylogenetic tree within the object, which should be relatively eassy to extract distance information from. However, the question remains: how would you make a similarity-sensitive metric on flowcytometric fingerprint data?
I can start with this if you can supply some data/guidelines.
Ultimately, using a repository such as CRAN for the stable release version of our package would be more user friendly.
A simple R alternative to python fcsparser:parse (https://github.com/eyurtsev/fcsparser/blob/master/fcsparser/api.py) should be able to do the trick.
There's a new Bioconductor software package for FCM data online that specifically allows ggplot-style plotting specifically for flow cytometry specific data structures. Might be nice to look into?
Here's the paper: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty441/5026650
Issue lies with flowFDA, flowCore or matrixStats:
* installing *source* package 'flowFDA' ...
** R
** inst
** preparing package for lazy loading
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called 'matrixStats'
Error : package 'flowCore' could not be loaded
ERROR: lazy loading failed for package 'flowFDA'
* removing 'C:/Users/fpkerckh/Documents/R/win-library/3.3/flowFDA'
Upon installing flowCore, also flowViz needs to be installed still:
* installing *source* package 'flowFDA' ...
** R
** inst
** preparing package for lazy loading
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called 'IDPmisc'
Error : package 'flowViz' could not be loaded
ERROR: lazy loading failed for package 'flowFDA'
* removing 'C:/Users/fpkerckh/Documents/R/win-library/3.3/flowFDA'
Then, an issue with multcomp appears to be present:
* installing *source* package 'flowFDA' ...
** R
** inst
** preparing package for lazy loading
Error : package 'TH.data' required by 'multcomp' could not be found
ERROR: lazy loading failed for package 'flowFDA'
* removing 'C:/Users/fpkerckh/Documents/R/win-library/3.3/flowFDA'
Which appears to clear the issue.
So, only upon manual installation of: flowViz
, flowCore
and multcomp
, Phenoflow could be installed.
Documentation for resampling strategies here
http://r-pkgs.had.co.nz/package.html : see name criteria
Make function that creates in silico communities from real data of axenic cultures.
in_silico()
Would make it more sound/robust version of current implementation. Downside is that it will be computationally intensive and require more memory due to the inefficiency of foreach
.
The examples section of each command should contain executable code of data that can be loaded from the package.
In the wiki, at one point,
### Export ecological data to .csv file in the chosen directory
write.csv2(file="results.metrics.csv",
cbind(Diversity.fbasis, Evenness.fbasis,
Structural.organization.fbasis,
Coef.var.fbasis))
is called.
However, currently the objects Evenness.fbasis
, Structural.organization.fbasis
, Coef.var.fbasis
are not generated in the alpha-diversity section.
In the corresponding vignette, I will remove them from this chunck. However @rprops : how would you like to proceed here? Shall we keep them in the wiki? Should they be added to the vignette?
Where is this initialized? Is this a global variable?
Phenoflow_package/R/RandomF_predict.R
Line 47 in 00687dc
In the Phenotypic Diversity Analysis wiki at a given time we select maxval <- max(summary[,9])
here the column identifier is largely dependent upon the parameter at that given column (which is FL1-H for the BD accuri C6 but may be completely different for e.g. the BD FACSVerse). Why don't we:
mytrans
to mytrans <- function(x) x/max(x)
?Have made a test dataset from plos paper available in Phenoflow.
Recently, we saw with the Accuri that in high-throuphut experiments there could occur missed issues on stability of the data. Could we implement a standardized check? We already (briefly) evaluated flowClean (http://bioconductor.org/packages/release/bioc/html/flowClean.html), however it appears to be quite slow.
Suggestions:
See e.g.
Phenoflow_package/R/RandomF_predict.R
Line 44 in 00687dc
This would allow creating multiple random forest models for each "group". For example if you have three groups of strains that you would like to distinguish on a strain by strain basis per group. This should be straightforward to implement by combined apply()
, and return results in list()
.
This makes interpretation counter-intuitive, as the goal for most classification purposes is to achieve a diagonal matrix.
This should be fairly straightforward by translating prubbens python code (https://github.com/prubbens/InSilicoFlow/blob/master/insilico.py)
to R with: https://cran.r-project.org/web/packages/randomForest/index.html and https://cran.r-project.org/web/packages/party/index.html
Have to look into how to automatically install non-CRAN packages (flowCore, flowViz, easyGgplot2)
So that users have confidence that devtools::install_github() will deliver
https://docs.travis-ci.com/user/languages/r/
Consider using parallel::mclapply to speed up the computation by putting different samples on different nodes.
It could replace the for loop at https://github.com/rprops/Phenoflow_package/blob/master/R/Diversity_16S.R#L44
However:
In the long run using Rcpp would be the more efficient solution here.
Will require custom css file.
When beta release is ready, use tags to create release branch
From Caret documentation for RandomForest feature importances:
"Random Forest: from the R package: βFor each tree, the prediction accuracy on the out-of-bag portion of the data is recorded. Then the same is done after permuting each predictor variable. The difference between the two accuracies are then averaged over all trees, and normalized by the standard error. For regression, the MSE is computed on the out-of-bag data for each tree, and then the same computed after permuting a variable. The differences are averaged and normalized by the standard error. If the standard error is equal to 0 for a variable, the division is not done.β"
Now that the latest bioconductor release deals with buggy Accuri data, shouldn't we consider making flowAI part of the default packages loaded/installed with Phenoflow?
Implement createTimeSlices
for random forest inference on time series FCM data.
Replace flowclean approach in FCS_clean with the flowAI one, and call this function in Diversity_rf
and RandomF_FCS
install.packages("formatR")
formatR::tidy_dir("R")
install.packages("lintr")
lintr::lint_package()
Problem: different plots from div_rf()
RandomF_FCS throws warnings when more than 1 dash occurs in the selected params (e.g. PerCP-Cy5.5-A). This is specifically due to
add_measuredparam <- unique(do.call(rbind, strsplit(param,"-"))[, 2])[1]
The warning that is returned is:
Warning message:
In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 1)
Whereas this (luckily) still returns the desired parameter in the current cases this is rather unpredictable behaviour, that we should circumvent. As a suggestion maybe we should rather grep (greedily) or use a final character anchor like .*-[AWH]$
or .*-[A-Z]$
to be more generic?
This will allow premerge prior to training model if unbalanced data sets per group are provided.
FCS_pool
on groupsFCS_resample
downsample to specified number of cellsA declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.