Comments (9)
You want to run Scoary on continuous traits?
I'm working on an update for Scoary, but it's not ready yet. Approximately another month until testing makes sense.
I will use GaussianMixture for splitting.
You could simply pre-process your continuous traits with GaussianMixture yourself and then feed them into Scoary.
from scoary.
Dear @MrTomRod
Yes, I've been wanting to add that feature for five years now. I'm not sure about the best way to implement it though, since much of the Scoary functionality is so intimately tied to having binary categories. Some of the approaches I've given some thought:
- Breaking continous traits up into quantiles and treating these as binary, then using permutations to get "null distribution" test estimators.
- Phylogenetic GLS. (A bit hard to harmonize with the Scoary approach since it would likely involve some kind of dimensionality reduction algorithm for the tree structure and an explicit evolutionary model.)
- Replicated simulations of the continous phenotype on the input tree with no respect to the genotype, again to create a null distribution to compare towards. Maybe this would also require an explicit evolutionary model. My impression is that an Ornstein-Uhlenbeck process would be the best alternative, although there might be better alternatives I'm not up to speed with.
As you've noticed Scoary is not exactly in very active development at the moment due to other pressing obligations, but I maintain the ambition to continue development on it. I will happily accept PRs or spin-offs as long as you credit the original work. Thanks for offering!
All the best,
Ola
from scoary.
I have been using GaussionMixture to split by a continuous trait. This is a histogram of an example trait:
- Blue is group 1
- Green is group 2
- Grey are those where the classifier was less than 85 % confident about which group the strain belongs to.
(has
/ has_not
indicate whether a strain has the highest-scoring orthogene. Sorry for not plotting the estimated Gaussian distributions.)
My approach is simple and straightforward, but maybe not too powerful. I need something to quickly work on thousands of continuous traits.
I'm not sure if I fully understand your suggestions, will have to think about that some more. Would you be willing to discuss this sometimes or perhaps even support me a little if I decided to do this?
Btw, I've been using Boschloo's test instead of Fisher's, since it is perfectly matches the problem and is more powerful. It is slower, though. Not sure if it's worth it.
from scoary.
That looks promising! And thanks for teaching me about Boschloo's test.
Absolutely willing to work with you on this in the time I can contribute. You can get in contact with me at any time through my e-mail: [email protected].
from scoary.
I wanted to compare Fisher's vs Boschloo's test. To do this, I simulated 10 pangenomes for each combination of sample size: [25, 50, 75, 100, 150, 200]
and penetrance: [90, 75]
. (As in the paper.) While the p-values from Boschloo's test are uniformly higher, the sensitivity decreased! These are the results:
Each dot represents the results from one simulated pangenome. The x-axis is the rank of the 'causal' gene in the final table computed using Fisher's minus the rank computed using Boschloo's. In other words, if the resulting value is negative, Fisher's performed better, and if it is positive, Boschloo's performed better.
I performed a Wilcoxon signed-rank test to see if Fisher and Boschloo perform differently: pvalue=0.057
.
While Boschloo's test (imo justifiedly) gives a lower p-value, Fisher's seems to perform better at ranking genes with the simulated data. I have no clue why that is, though.
from scoary.
Updated plot with improved ranking, based on pvalue instead of position in table. Didn't change the result.
pvalue=0.0024
from scoary.
I performed the same analysis with my fast-fisher library. It is now incredibly fast.
The causal gene always got the same rank as with scipy's implementation, except for two simulated datasets: in one, the rank was one higher, in the other, it was one lower.
from scoary.
Heya,
Any chance this feature will come soon? I'ts exactly what I have been looking for and would perfectly fit into my workflow. I'm also happy to test.
Best wishes,
Tse
from scoary.
I think @MrTomRod did the necessary updates and put it as Scoary-2. Kindly check his GitHub repo
from scoary.
Related Issues (20)
- How can I explore the differences between subpopulations defined using population analysis?
- genetic differences among populations defined by population analysis HOT 2
- How to generate manhatton plot from Scoary results HOT 1
- Stop at "Calculating max number of contrasting pairs for each nominally significant gene: in the Ternimal
- Should assemblies be removed if core gene alignment shows redundancies?
- Unrecognized character found in trait file HOT 2
- UnicodeDecode Error while reading traits file
- Significance of the worst_pairwise_p HOT 3
- missing data in genotype file
- _csv.Error: field larger than field limit (131072)
- IndexError: list index out of range HOT 4
- Gene enrichments across host sites
- /var/spool/gridengine/execd/cu17/job_scripts/371921: line 11: 9884 Killed HOT 1
- I got the CSV files of all trait related genes. How do I get the important genes in all trait files?
- Question about --collapse HOT 2
- collapse flag output
- Startcol error HOT 1
- Convert vcf from parsnp to Scoary input
- Large number of significant genes HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scoary.