Giter Site home page Giter Site logo

deicode-benchmarking's People

Contributors

cameronmartino avatar

Stargazers

 avatar

Watchers

 avatar  avatar

deicode-benchmarking's Issues

How is `n_components` to be selected?

I was looking back through the code for your simulations in the mSystems paper trying to understand how I would use gemelli with my own data. I'm wondering whether you could clarify for me how the rank/n_components argument in deicode/gemelli is supposed to be used.

In your Gemelli tutorials, I noticed that the default value of n_components is being used (i.e., 3) in the IBD tutorial when there are two groups (i.e., control and crohn's) and in the Moving Pictures it is using auto-rpca when there isn't a clear number of groups to use.

However, when I looked through the code for your case studies, it appears that you're using rank=2 for the Sponge and rank=3 for the Sleep Apnea datasets when running deicode_rpca (see here). The rank for the Sponge dataset corresponds with the two levels of health_status, but I'd expect the rank for the Sleep Apnea dataset to be 2 since there are two levels in the exposure_type column of the metadata file. Meanwhile, in the negative and positive control simulations you use the default when running rclr and OptSpace although I'd expect it to be rank=2.

When I compared the output using different n_components values for the same dataset there does appear to be significant variation in the resulting distances, which makes me think that the value chosen is important. Here's an example (I noticed that the default value for --min-feature-count changed between the paper and gemelli, which was 10 and is now zero)...

gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_2 --min-sample-count 1000 --min-feature-count 10 --n-components 2
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_3 --min-sample-count 1000 --min-feature-count 10  --n-components 3
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_9 --min-sample-count 1000 --min-feature-count 10  --n-components 9
gemelli auto-rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_auto --min-sample-count 1000 --min-feature-count 10

Here's a comparison of the distances with the distances with --n-components 3 on the x-axis. The auto-rpca approach estimates the rank at 3.

comparison

The first axis explains 84, 66, and 38% of the variation when using --n-components values of 2, 3, and 9, respectively.

I think I've proven this to myself as I file this issue, but am I correct that we should be picking a --n-components value that we expect to correspond to the major treatment group if it is a cross-sectional study? I'm just a bit uneasy since something else seemed to be going on in the mSystems paper and for my example, auto-rpca didn't seem to pick the correct rank.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.