sa-lee / starmie Goto Github PK
View Code? Open in Web Editor NEWstarmie: plotting and inference for population structure models :star2:
License: Other
starmie: plotting and inference for population structure models :star2:
License: Other
on line 43 of clummp.R index is wrong.
clummp.R throws an error on greedy when
Error in Q_1 - Q_2 : non-conformable arrays
Hi We have just generate a few structure outputs using command line structure:
structure -K 5 -D 9 -i o pop_K5_D9
The output is found as pop_K5_D9.out_f
However, when I try to load it to starmie:
library(starmie)
K5<-loadStructure("pop_K5_D9.out_f")
I get the following error
Error in dimnames(x) <- dn
length of 'dimnames' [2] not equal to array extent
Is there anything wrong with the parsing? I have not changed anything inside the output of structure. I am using structure v2.3.4 with standard parameters from SNP data (VCF converted to structure).
I look forward to hearing from you.
Kind regards
see http://www.cs.cmu.edu/~epxing/Class/10701-08s/recitation/dirichlet.pdf and
https://en.wikipedia.org/wiki/Dirichlet_distribution
k <- 20
r <- 20
n_samples <- 100
# generate Q matrix by sampling from diricihlet
# keep alpha hyper param constant (same mean and variance for each K)
Q_1 <- MCMCpack::rdirichlet(n_samples, rep(1, k))
# permute labels by shuffling and adding gaussian noise noise over number of runs
permute_and_add_noise <- function(Q_mat, r) {
logit <- function(x) { log(x / (1-x)) }
perturb <- function(Q_mat) {
# add gaussian noise to each row of matrix
q_random <- apply(Q_mat, 1,
function(y) 1 / (1 + exp(-rnorm(length(y),
mean = logit(y),
sd = 0.01))))
# normalise rows
q_normalised <- q_random / rowSums(q_random)
# shuffle labels
q_normalised[, sample(seq_len(ncol(q_normalised)))]
}
replicate(r - 1, perturb(Q_mat))
}
permute_and_add_noise(Q_1, r)
Will have to fix labelling at some point but it is a start.
better than CLUMPP possibly which has problem of multimodals
i.e. construct mainparams, extraparams
i.e. something like
makeReport(struct_objects)
will produce an interface with the tests, MCMC plots etc.
i.e. prestored as struct lists in a function call, makes running examples look cleaner.
i.e.
# Run example
microsat <- starmieGet("structure_mulitple_runs")
instead of the corresponding admix/struct/fast object.
For ADMIXTURE:
For STRUCTURE:
Since structure model is the same thing as LDA (with 'documents' being samples, 'topics' as latent populations, and 'words' as genetic markers) we could use some of the ideas in
http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf
https://github.com/cpsievert/LDAvis
to create some more informative/compelling plots
See also https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
A function to load multiple Admixture output files into a useful data frame.
Input:
Multiple Admixture .P and .Q files
The initial input file fed to Admixture
Output:
A sensible data frame that make for easy downstream analysis
plus implement AIC, BIC, DIC metrics.
http://www.genetics.org/content/early/2016/06/10/genetics.115.180992.long
simulation methods for testing evanno etc.
sample from posterior distrubition with known K, and admixture coeffecients
Do you have a reference for the formula you use in averagePairWiseSimilarityH?
this is generally inconsistent in the package, some methods return both the data and plot, others just the plot.
A function that runs the method of Evanno and plots ln(Pr(X|K) values in order to identify the k for which Pr(K=k) is highest (as described in STRUCTURE's manual, section 5.1)
Is it possible to load multiple STRUCTURE output files for the CLUMPPING together at once? At the moment it looks like that you have to load each output file separately
k6_msat <- loadStructure(k6_file),.... and combine them with k6_all <- structList(k6_msat, k6_run2) which can then be used to Q_list <- lapply(k6_all, getQ). It would be much easier if it was possible to load for example all 10 repeat files of a certain K at once with something like allfiles <- choose.files() and then Q_list <- lapply(allfiles, getQ). Or is that possible already?
Might be nice to have a function to convert input types although this is less important
if you have any good example datasets we could use from the output of each of these programs let me know.
Incriminating code is here for ancestral_df
ances_lines <- s_f[(which(grepl("^Inferred ancestry of.*",
s_f)) + 1):(which(grepl("^Estimated Allele Frequencies .*",
s_f)) - 1)]
ances_lines <- str_trim(ances_lines)
ances_lines <- gsub("[(:)]", "", ances_lines)
ances_lines <- str_split_fixed(ances_lines, "\\s+", n = 4 +
pops)
header <- ances_lines[1, ][1:3]
ances_lines <- ances_lines[2:nrow(ances_lines), 2:ncol(ances_lines)]
# if Label (column 1 contains a string value the next line will produce NAs
class(ances_lines) <- "numeric"
ancest_df <- data.frame(ances_lines)
colnames(ancest_df)[1:3] <- header
colnames(ancest_df)[4:ncol(ancest_df)] <- paste("Cluster",
seq(1, ncol(ancest_df) - 3))
Considering most of the utility of our package comes from combining multiple lists of struct objects - is it worth creating an S3 object that stores multiple Ks/runs?
A function to load multiple STRUCTURE output files.
Perhaps we should split this into multiple functions that iteratively add the output of the post processing programs (clump, harvester etc).
Inputs:
Output of STRUCTURE, clump, clumpak? and harvester.
Output:
A useful data frame for further downstream analysis.
Hi, when I go to add the data generated (from runStructure) via loadStructure, it keeps introducing NA's into the last column of the file! Therefore when I generate clumpp files, the final cluster is NA and I can't generate plots etc. !
k3_1 <- loadStructure("/R_STRUCTURE/run01_K_03.out_f")
k3_2 <- loadStructure("/R_STRUCTURE/run02_K_03.out_f")
k3_3 <- loadStructure("/R_STRUCTURE/run03_K_03.out_f")
NAs introduced by coercion
k3_struct <- structList(k3_1, k3_2, k3_3) #works
Q_k3 <- lapply(k3_struct, getQ) #works but then introduces NA's in last column e.g.:
Can I please have some help!
Take the output of the load admixture and load structure function and generate a stacked barplot split by K.
Input:
The output of load structure or load admixture
Output:
A ggplot2 object
Would be good to include support for the bootstrapping files for admixture. I.e when the option -B is used.
implement clumpak algorithm
The current populations parameter requires a data.frame with isolate then population in two seperate columns. This is clunky to use in practice as we usually just want to give it a vector. If you agree with this change @sa-lee I can implement it in the DEV although it will mean we have to update the docs.
i.e. get_admixture, get_sample etc etc
we've tried combinations of
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.