The starmie's discuss from sa-lee

Bug: clummp greedyLargeK broken

on line 43 of clummp.R index is wrong.

clummp.R throws an error on greedy when
Error in Q_1 - Q_2 : non-conformable arrays

Cannot load structure output

Hi We have just generate a few structure outputs using command line structure:

structure -K 5 -D 9 -i o pop_K5_D9
The output is found as pop_K5_D9.out_f

However, when I try to load it to starmie:

library(starmie)
K5<-loadStructure("pop_K5_D9.out_f")

I get the following error

Error in dimnames(x) <- dn
     length of 'dimnames' [2] not equal to array extent

Is there anything wrong with the parsing? I have not changed anything inside the output of structure. I am using structure v2.3.4 with standard parameters from SNP data (VCF converted to structure).

I look forward to hearing from you.

Kind regards

simulate Q matrix with repeated runs from dirichlet distribution

see http://www.cs.cmu.edu/~epxing/Class/10701-08s/recitation/dirichlet.pdf and
https://en.wikipedia.org/wiki/Dirichlet_distribution

k <- 20
r <- 20
n_samples <- 100
# generate Q matrix by sampling from diricihlet
# keep alpha hyper param constant (same mean and variance for each K)
Q_1 <- MCMCpack::rdirichlet(n_samples, rep(1, k))
# permute labels by shuffling and adding gaussian noise noise over number of runs
permute_and_add_noise <- function(Q_mat, r) {
  logit <- function(x) { log(x / (1-x)) }
  perturb <- function(Q_mat) { 
    # add gaussian noise to each row of matrix
    q_random <- apply(Q_mat, 1, 
                      function(y) 1 / (1 + exp(-rnorm(length(y), 
                                                      mean = logit(y), 
                                                      sd = 0.01))))
    # normalise rows 
    q_normalised <- q_random / rowSums(q_random)
    # shuffle labels 
    q_normalised[, sample(seq_len(ncol(q_normalised)))]
  }

  replicate(r - 1, perturb(Q_mat))
}

permute_and_add_noise(Q_1, r)

Will have to fix labelling at some point but it is a start.

fixed k-means for looking at best permutation

better than CLUMPP possibly which has problem of multimodals

Generate STRUCTURE minimal files

i.e. construct mainparams, extraparams

local shiny interface

i.e. something like

makeReport(struct_objects)

will produce an interface with the tests, MCMC plots etc.

cleaner example datasets

i.e. prestored as struct lists in a function call, makes running examples look cleaner.
i.e.

# Run example
microsat <- starmieGet("structure_mulitple_runs")

bug: subsetting an admixList/structList/fastList to a single object returns a list

instead of the corresponding admix/struct/fast object.

diagnostic plots

For ADMIXTURE:

plot CV error against K
plot final loglikelihood against K

For STRUCTURE:

plot MCMC reps against loglikelihood/alpha values over each K values

Since structure model is the same thing as LDA (with 'documents' being samples, 'topics' as latent populations, and 'words' as genetic markers) we could use some of the ideas in
http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf
https://github.com/cpsievert/LDAvis
to create some more informative/compelling plots

functions that make use of function from another library only once should be in suggests

Vignette preparation

how to run structure so it actually works with our package
how to run admixture so it actually works with our package
about the starmie object
autoplot and fortify methods

print method for struct object

Load Admixture Data

A function to load multiple Admixture output files into a useful data frame.

Input:
Multiple Admixture .P and .Q files
The initial input file fed to Admixture

Output:
A sensible data frame that make for easy downstream analysis

clean up bestK function

plus implement AIC, BIC, DIC metrics.

http://www.genetics.org/content/early/2016/06/10/genetics.115.180992.long

simulate from STRUCTURE model

simulation methods for testing evanno etc.
sample from posterior distrubition with known K, and admixture coeffecients

averagePairWiseSimilarityH

Do you have a reference for the formula you use in averagePairWiseSimilarityH?

plot methods should return ggplot object

this is generally inconsistent in the package, some methods return both the data and plot, others just the plot.

Estimate the best K

A function that runs the method of Evanno and plots ln(Pr(X|K) values in order to identify the k for which Pr(K=k) is highest (as described in STRUCTURE's manual, section 5.1)

cody tidy up before release

remove loadMetaData.R
move plotRanks.R to dev branch
move simulateQmatrix.R to dev branch

inputs to functions should take struct, admix or Q matrix (either as lists or individually)

correlation metric for label switching problem

Loading multiple files at once

Is it possible to load multiple STRUCTURE output files for the CLUMPPING together at once? At the moment it looks like that you have to load each output file separately
k6_msat <- loadStructure(k6_file),.... and combine them with k6_all <- structList(k6_msat, k6_run2) which can then be used to Q_list <- lapply(k6_all, getQ). It would be much easier if it was possible to load for example all 10 repeat files of a certain K at once with something like allfiles <- choose.files() and then Q_list <- lapply(allfiles, getQ). Or is that possible already?

Convert Admixture inputs to Structure and vice versa

Might be nice to have a function to convert input types although this is less important

loadAdmixture fails on the log files if cv hasnt been run.

test datasets

if you have any good example datasets we could use from the output of each of these programs let me know.

Bug: loadStructure deletes sample/pop labels if they are non-numeric

Incriminating code is here for ancestral_df

   ances_lines <- s_f[(which(grepl("^Inferred ancestry of.*", 
        s_f)) + 1):(which(grepl("^Estimated Allele Frequencies .*", 
        s_f)) - 1)]
    ances_lines <- str_trim(ances_lines)
    ances_lines <- gsub("[(:)]", "", ances_lines)
    ances_lines <- str_split_fixed(ances_lines, "\\s+", n = 4 + 
        pops)
    header <- ances_lines[1, ][1:3]
    ances_lines <- ances_lines[2:nrow(ances_lines), 2:ncol(ances_lines)]
# if Label (column 1 contains a string value the next line will produce NAs
    class(ances_lines) <- "numeric"
    ancest_df <- data.frame(ances_lines)
    colnames(ancest_df)[1:3] <- header
    colnames(ancest_df)[4:ncol(ancest_df)] <- paste("Cluster", 
        seq(1, ncol(ancest_df) - 3))

starmieList object?

Considering most of the utility of our package comes from combining multiple lists of struct objects - is it worth creating an S3 object that stores multiple Ks/runs?

Load Structure Data

A function to load multiple STRUCTURE output files.

Perhaps we should split this into multiple functions that iteratively add the output of the post processing programs (clump, harvester etc).

Inputs:
Output of STRUCTURE, clump, clumpak? and harvester.

Output:
A useful data frame for further downstream analysis.

loadStructure introducing NAs

Hi, when I go to add the data generated (from runStructure) via loadStructure, it keeps introducing NA's into the last column of the file! Therefore when I generate clumpp files, the final cluster is NA and I can't generate plots etc. !

k3_1 <- loadStructure("/R_STRUCTURE/run01_K_03.out_f")
k3_2 <- loadStructure("/R_STRUCTURE/run02_K_03.out_f")
k3_3 <- loadStructure("/R_STRUCTURE/run03_K_03.out_f")

NAs introduced by coercion

k3_struct <- structList(k3_1, k3_2, k3_3) #works
Q_k3 <- lapply(k3_struct, getQ) #works but then introduces NA's in last column e.g.:

Can I please have some help!

reduce package size

run admixture on smaller dataset
delete logs unless they are used by an example function

Generate barplot

Take the output of the load admixture and load structure function and generate a stacked barplot split by K.

Input:
The output of load structure or load admixture

Output:
A ggplot2 object

Clumpp needs to handle k=1 gracefully

add phlyogentic trees to bar plots

Add in bootstrapping support for admixture

Would be good to include support for the bootstrapping files for admixture. I.e when the option -B is used.

label switching package

https://cran.r-project.org/web/packages/label.switching/label.switching.pdf

clumpak

implement clumpak algorithm

populations parameter should be a character vector

The current populations parameter requires a data.frame with isolate then population in two seperate columns. This is clunky to use in practice as we usually just want to give it a vector. If you agree with this change @sa-lee I can implement it in the DEV although it will mean we have to update the docs.

linkage/no-linkage
-admixture/no-admixture
a few others.
need to see if struct objects are still given sensisble output if main/extraparams change

sa-lee / starmie Goto Github PK

starmie's Issues

Recommend Projects

Recommend Topics

Recommend Org