cameronmartino / deicode-benchmarking Goto Github PK
View Code? Open in Web Editor NEWdeicode benchmarking repo
deicode benchmarking repo
I was looking back through the code for your simulations in the mSystems paper trying to understand how I would use gemelli with my own data. I'm wondering whether you could clarify for me how the rank
/n_components
argument in deicode/gemelli is supposed to be used.
In your Gemelli tutorials, I noticed that the default value of n_components
is being used (i.e., 3) in the IBD tutorial when there are two groups (i.e., control and crohn's) and in the Moving Pictures it is using auto-rpca
when there isn't a clear number of groups to use.
However, when I looked through the code for your case studies, it appears that you're using rank=2
for the Sponge and rank=3
for the Sleep Apnea datasets when running deicode_rpca
(see here). The rank for the Sponge dataset corresponds with the two levels of health_status
, but I'd expect the rank for the Sleep Apnea dataset to be 2 since there are two levels in the exposure_type
column of the metadata file. Meanwhile, in the negative and positive control simulations you use the default when running rclr and OptSpace although I'd expect it to be rank=2
.
When I compared the output using different n_components
values for the same dataset there does appear to be significant variation in the resulting distances, which makes me think that the value chosen is important. Here's an example (I noticed that the default value for --min-feature-count
changed between the paper and gemelli
, which was 10 and is now zero)...
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_2 --min-sample-count 1000 --min-feature-count 10 --n-components 2
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_3 --min-sample-count 1000 --min-feature-count 10 --n-components 3
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_9 --min-sample-count 1000 --min-feature-count 10 --n-components 9
gemelli auto-rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_auto --min-sample-count 1000 --min-feature-count 10
Here's a comparison of the distances with the distances with --n-components 3
on the x-axis. The auto-rpca
approach estimates the rank at 3.
The first axis explains 84, 66, and 38% of the variation when using --n-components
values of 2, 3, and 9, respectively.
I think I've proven this to myself as I file this issue, but am I correct that we should be picking a --n-components
value that we expect to correspond to the major treatment group if it is a cross-sectional study? I'm just a bit uneasy since something else seemed to be going on in the mSystems paper and for my example, auto-rpca
didn't seem to pick the correct rank.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.