The deicode-benchmarking from cameronmartino

How is `n_components` to be selected?

I was looking back through the code for your simulations in the mSystems paper trying to understand how I would use gemelli with my own data. I'm wondering whether you could clarify for me how the rank/n_components argument in deicode/gemelli is supposed to be used.

In your Gemelli tutorials, I noticed that the default value of n_components is being used (i.e., 3) in the IBD tutorial when there are two groups (i.e., control and crohn's) and in the Moving Pictures it is using auto-rpca when there isn't a clear number of groups to use.

However, when I looked through the code for your case studies, it appears that you're using rank=2 for the Sponge and rank=3 for the Sleep Apnea datasets when running deicode_rpca (see here). The rank for the Sponge dataset corresponds with the two levels of health_status, but I'd expect the rank for the Sleep Apnea dataset to be 2 since there are two levels in the exposure_type column of the metadata file. Meanwhile, in the negative and positive control simulations you use the default when running rclr and OptSpace although I'd expect it to be rank=2.

When I compared the output using different n_components values for the same dataset there does appear to be significant variation in the resulting distances, which makes me think that the value chosen is important. Here's an example (I noticed that the default value for --min-feature-count changed between the paper and gemelli, which was 10 and is now zero)...

gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_2 --min-sample-count 1000 --min-feature-count 10 --n-components 2
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_3 --min-sample-count 1000 --min-feature-count 10  --n-components 3
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_9 --min-sample-count 1000 --min-feature-count 10  --n-components 9
gemelli auto-rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_auto --min-sample-count 1000 --min-feature-count 10

Here's a comparison of the distances with the distances with --n-components 3 on the x-axis. The auto-rpca approach estimates the rank at 3.

The first axis explains 84, 66, and 38% of the variation when using --n-components values of 2, 3, and 9, respectively.

I think I've proven this to myself as I file this issue, but am I correct that we should be picking a --n-components value that we expect to correspond to the major treatment group if it is a cross-sectional study? I'm just a bit uneasy since something else seemed to be going on in the mSystems paper and for my example, auto-rpca didn't seem to pick the correct rank.

cameronmartino / deicode-benchmarking Goto Github PK

deicode-benchmarking's People

Contributors

Stargazers

Watchers

deicode-benchmarking's Issues

How is `n_components` to be selected?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent