gtonkinhill / rhierbaps Goto Github PK

View Code? Open in Web Editor NEW

33.0 5.0 3.0 1.67 MB

rhierbaps: R implementation of hierBAPS

License: Other

R 100.00%

population-genomics population-genetics population-structure r

rhierbaps's Introduction

rhierbaps

We have recently developed a faster verion of the BAPs clustering method. It can be found here.

Installation

rhierbaps is available on CRAN.

install.packages("rhierbaps")

The development version is available on github. It can be installed with devtools

install.packages("devtools")

devtools::install_github("gtonkinhill/rhierbaps")

If you would like to also build the vignette with your installation run:

devtools::install_github("gtonkinhill/rhierbaps", build_vignettes = TRUE)

Quick Start

Run hierBAPS.

# install.packages('rhierbaps')
library(rhierbaps)

fasta.file.name <- system.file("extdata", "seqs.fa", package = "rhierbaps")
snp.matrix <- load_fasta(fasta.file.name)
hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, quiet = TRUE)
head(hb.results$partition.df)
#>   Isolate level 1 level 2
#> 1       1       1       1
#> 2       2       1       1
#> 3       3       1       1
#> 4       4       2       5
#> 5       5       3       9
#> 6       6       3       9

The hierBAPS algorithm was introduced in (Cheng et al. 2013) and provides a method for hierarchically clustering DNA sequence data to reveal nested population structure. Previously the algorithm was available as a compiled MATLAB binary. We provide a convenient R implementation and include a number of useful additional options including the ability to use multiple cores, save the log marginal likelihood scores and run the algorithm until local convergence. Furthermore, we provide a wrapper to a ggtree plotting function allowing for easy exploration of sub-clusters.

Things to keep in mind before running hierBAPS

hierBAPS uses a uniform prior for K.
The prior for a site depend on the available snps, i.e. if a site only has ‘AC,’ then the prior for ‘ACGT’ is (1/2, 1/2, 0, 0)
The initial sequence partition is generated using hierarchical clustering with complete linkage based on a Hamming distance matrix.
The initial number of populations should be set much higher than the expected number of populations.
More search rounds of the algorithm can be added using the n.extra.rounds parameter.
To get reproducible results the seed in R must be set.

Libraries

library(rhierbaps)
library(ggtree)
library(phytools)
library(ape)

set.seed(1234)

Loading data

We first need to load a multiple sequence alignment in fasta format. We can then generate the required SNP matrix.

fasta.file.name <- system.file("extdata", "seqs.fa", package = "rhierbaps")
snp.matrix <- load_fasta(fasta.file.name)

If you wish to include singleton SNPs (those that appear in only one isolate) then set keep.singletons=FALSE. However, this is currently advised against as these SNPs lead to a higher number of parameters in the model and do not provide information about shared ancestry.

It is also possible to load an ape DNAbin object. Here me make use of the woodmouse dataset in ape.

data(woodmouse)
woodmouse.snp.matrix <- load_fasta(woodmouse)

Running hierBAPS

We now need to decide how many levels of clustering we are interested in and the number of initial clusters to start from. It is a good idea to choose n.pops to be significantly larger than the number of clusters you expect.

To run hierBAPS with 2 levels and 20 initial clusters we run

hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, quiet = TRUE)
head(hb.results$partition.df)
#>   Isolate level 1 level 2
#> 1       1       1       1
#> 2       2       1       1
#> 3       3       1       1
#> 4       4       2       5
#> 5       5       3       9
#> 6       6       3       9

This produces a list which includes a data frame indicating the resulting partition of the isolates at the difference levels. The isolate names in this data frame are taken from the fasta headers and thus for plotting it is important that these match the isolate names in any tree used later. This function also outputs the log marginal likelihoods at the different levels of clustering.

hierBAPS can also be run until the algorithm converges to a local optimum as

hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, n.extra.rounds = Inf, 
    quiet = TRUE)

We can also check how long hierBAPS takes to run on the test dataset of 515 samples and 744 SNPs.

system.time(hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, quiet = TRUE))
#>    user  system elapsed 
#>  84.184  17.152 102.219

Plotting results

To plot the results it is useful to consider a tree of the same isolates. We clustered the example isolates using Iqtree (Kalyaanamoorthy et al. 2017). The ggtree (Yu et al. 2017) package then allows us to plot the results.

First we need to load the newick file.

newick.file.name <- system.file("extdata", "seqs.fa.treefile", package = "rhierbaps")
iqtree <- phytools::read.newick(newick.file.name)

A simple coloured tree allows us to see the top level cluster assignment from hierBAPS.

gg <- ggtree(iqtree, layout = "circular")
gg <- gg %<+% hb.results$partition.df
gg <- gg + geom_tippoint(aes(color = factor(`level 1`)))
gg

As there are many more clusters at the second level using colours to distinguish them can get confusing. Instead we can label the tips with their corresponding clusters.

gg <- ggtree(iqtree, layout = "circular", branch.length = "none")
gg <- gg %<+% hb.results$partition.df
gg <- gg + geom_tippoint(aes(color = factor(`level 1`)))
gg <- gg + theme(legend.position = "right")
gg <- gg + geom_tiplab(aes(label = `level 2`), size = 1, offset = 1)
gg

We can also zoom in on a particular top level cluster to get a better idea of how it is partitioned at the lower level. As an example we zoom in on sub cluster 9 at level 1.

plot_sub_cluster(hb.results, iqtree, level = 1, sub.cluster = 9)

Finally, we can inspect the log marginal likelihoods given for each level.

hb.results$lml.list
#> $`Depth 0`
#>         1 
#> -50858.92 
#> 
#> $`Depth 1`
#>          1          2          3          4          5          6          7 
#> -2121.8599 -4012.3594 -4237.7639 -3095.1865 -1525.7356 -3180.7572 -4015.5020 
#>          8          9         10         11         12         13 
#> -2104.5277 -1736.0192  -780.0635  -810.7793  -688.5214  -163.3198

Caculating assignment probabilities

We can also calculate the individual probabilities of assignment to each cluster. Here we make use of the woodmouse dataset loaded earlier.

hb.results.woodmouse <- hierBAPS(woodmouse.snp.matrix, max.depth = 2, n.extra.rounds = Inf, 
    quiet = TRUE, assignment.probs = TRUE)
head(hb.results.woodmouse$cluster.assignment.prob[[1]])
#>            Cluster 1    Cluster 2    Cluster 3
#> No305   9.997868e-01 2.104112e-04 2.805482e-06
#> No304   5.620947e-06 9.999944e-01 1.699254e-11
#> No306   8.996214e-03 9.910038e-01 9.626735e-09
#> No0906S 9.965743e-01 3.425673e-03 1.902359e-08
#> No0908S 9.911304e-01 8.869359e-03 2.068655e-07
#> No0909S 2.615477e-09 1.105831e-10 1.000000e+00

Saving results

For runs that take a long time it is a good idea to save the output. We can save the partition file as

write.csv(hb.results$partition.df, file = file.path(tempdir(), "hierbaps_partition.csv"), 
    col.names = TRUE, row.names = FALSE)

save_lml_logs(hb.results, file.path(tempdir(), "hierbaps_logML.txt"))

Citing rhierbaps

If you use rhierbaps in a research publication please cite both

Tonkin-Hill, Gerry, John A. Lees, Stephen D. Bentley, Simon D. W. Frost, and Jukka Corander. 2018. “RhierBAPS: An R Implementation of the Population Clustering Algorithm hierBAPS.” Wellcome Open Research 3 (July): 93.

Cheng, Lu, Thomas R. Connor, Jukka Sirén, David M. Aanensen, and Jukka Corander. 2013. “Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software.” Molecular Biology and Evolution 30 (5): 1224–28.

References

Cheng, Lu, Thomas R Connor, Jukka Sirén, David M Aanensen, and Jukka Corander. 2013. “Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software.” Mol. Biol. Evol. 30 (5): 1224–28. https://doi.org/10.1093/molbev/mst028.

Kalyaanamoorthy, Subha, Bui Quang Minh, Thomas K F Wong, Arndt von Haeseler, and Lars S Jermiin. 2017. “ModelFinder: Fast Model Selection for Accurate Phylogenetic Estimates.” Nat. Methods 14 (6): 587–89. https://doi.org/10.1038/nmeth.4285.

Paradis, Emmanuel, Julien Claude, and Korbinian Strimmer. 2004. “APE: Analyses of Phylogenetics and Evolution in R Language.” Bioinformatics 20 (2): 289–90. https://doi.org/10.1093/bioinformatics/btg412.

Revell, Liam J. 2012. “Phytools: An R Package for Phylogenetic Comparative Biology (and Other Things).” Methods Ecol. Evol. 3 (2): 217–23. https://doi.org/10.1111/j.2041-210X.2011.00169.x.

Yu, Guangchuang, David K Smith, Huachen Zhu, Yi Guan, and Tommy Tsan-Yuk Lam. 2017. “Ggtree: An r Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods Ecol. Evol. 8 (1): 28–36. https://doi.org/10.1111/2041-210X.12628.

rhierbaps's People

Contributors

Stargazers

Watchers

Forkers

msnunes adachornelia rnaimehaom

rhierbaps's Issues

patch work

Hello! I want to make a patch work with rhierbaps, but I cannot find the function. Can you tell me how to make a patchwork?

Issue installing/compiling RcppParallel dependency with clang

I had an issue installing most likely due to compiling with clang. I installed RcppParallel separately via: devtools::install_github("RcppCore/RcppParallel") and it seemed to address the issue. I am not sure how pervasive this is, but I thought it was worth sharing.

Error getting assignment probabilities

I have a snp dataset that is working fine in rhierbaps, except when I set assignment.probs = TRUE. Then, I invariable receive this error:

Error in cluster.allele.freqs[cbind(seq, 1:ncol(cluster.allele.freqs))] :
subscript out of bounds

Is this a known issue? Is there a work-around?
thank you.

Devtools installation fail. WITH SOLUTION.

Hi.

On Linux, I have experienced the following:

devtools::install_github("gtonkinhill/rhierbaps")

from URL https://api.github.com/repos/gtonkinhill/rhierbaps/zipball/master
Installation failed: error in running command

Following instructions here I was able to install.

Basically, do the following:

options(unzip = "internal") # before running this, I checked with getOption, and the unzip was unset
devtools::install_github("gtonkinhill/rhierbaps")

Cheers. Anders.

Random introduction of "-"

Dear everyone,

I have been using Rhierbaps for some months now and have recently realised a weird behaviour that I would like to point out.

I have loaded a multialignment fasta file using the load.fasta function and when I checked the dataframe I realised that there were "-" along the different samples. This at first wasn't strange, since there are some samples that have one or two ambiguous calls like K or R. However, what called my attention was the fact that the "-" were appearing in more samples, including those that did not have any ambiguous calls. In the example below, you can see that sample 10 has a "-" at position n3, whereas in the msa this should be a "T".

I asked a colleague of mine to repeat this using her computer and she got the same results. She then checked the dataframes from her data and realised that this happens as well on her data:

Again, there is a "T" at position n1 that is being called as a "-". We have also seen that this happens as well with "Gs".

This appears to happen randomly throughout the data frame, but the errors do appear in the same places each time we run the load.fasta function. We would like to know if you have seen this behaviour before and know what could be the cause?

Our Rhierbaps version is 1.1.4 and R version are 4.2.3 and 4.1.3, respectively.

Thanks!

What is the meaning of levels in partition.df?

Hi Dr Gerry,

I have a question about the hierBAPS result.
I loaded my fasta file and looked at the partition.df as follows

snp.matrix <- load_fasta("test.aln.snp_sites.aln")
hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, n.extra.rounds = Inf, quiet = TRUE)
clusters <- data.frame(hb.results$partition.df)

This provides a data matrix with the column names of 'Isolate', 'level.1' and 'level.2'

Running unique(clusters$level.1) resulted in clusters as 1,2,3,4

Out of curiosity I did

library(tidyverse)

#Extract the rows when level.1 = 1 in clusters table
clusters.level.one.equal.one <- clusters %>% filter(level.1 == 1))

#Extract the rows when level.1 = 2 in clusters table
clusters.level.one.equal.two <- clusters %>% filter(level.1 == 2)

unique(clusters.level.one.equal.one $level.2) resulted in 1 2 3 4

unique(clusters.level.one.equal.two$level.2) resulted in 5 6

The RhierBAPS paper mentions "The hierBAPS algorithm extends this approach by enabling the investigation of a population at multiple resolutions. This is achieved by initially clustering the entire dataset using the BAPS algorithm before iteratively applying the algorithm to each of the resulting clusters."

Is it correct to say that the algorithm first generates level.1 and then run on each cluster identified in level.1 to generate level.2 ? I was expecting unique(clusters.level.one.equal.two$level.2) to show 1 2 instead of 5 6
What would be the use of assignment.probs (whether or not to calculate the assignment probabilities to each cluster) in the hierBAPS function?

memory allocation problem

Hi,
I have used rhierbaps previously on 27 genomes' core alignment and it worked fine. But now I am working on 557 genomes's core alignment (file size = 1.6 gb). I have used this command and following message shows up:

> snp.matrix <- load_fasta("./clean.full.aln")
Error: cannot allocate vector of size 11.7 Gb

Is it issue with my system or the package?
My Linux PC has 4 cores and 8gb.
Thanks

Error when trying to plot the results

Hello!
I am trying to use rhierbaps on my data. I am working with R version 4.1.1 and
phytools_1.5-1 maps_3.4.0 ape_5.7 ggtree_3.2.1 rhierbaps_1.1.4

The first issue I have is that when I try to run

gg <- ggtree(iqtree_Current_Katsares_16S_COI,layout="circular")
gg <- gg%<+%hb.results_Current_Katsares_16S_COI$partition.df
gg <- gg+geom_tippoint(aes(color=factor(level 1)))
gg

I get an error

Error: Must request at least one colour from a hue palette.

Also, when I try to run

plot_sub_cluster(hb.results_Current_Katsares_16S_COI, iqtree_Current_Katsares_16S_COI, level = 1, sub.cluster = 1)

I get the error
Error in UseMethod("left_join") : no applicable method for 'left_join' applied to an object of class "waiver" In addition: Warning message: In drop.tip.phylo(tree, tree$tip.label[!tree$tip.label %in% cluster.isolate]) : drop all tips of the tree: returning NULL

Do you have any idea how to solve this?

Error in load alignment file

Hi, I am trying to make a hierarchical cluster of a total 245 bacterial samples. I ran Snippy and got a core.aln file.
If I provide the .aln file as input in the rhierbaps method, it does not recognize the alignment format. Please let me know how to prepare an alignment file for this method to make a hierarchical cluster. I am following the link https://cran.r-project.org/web/packages/rhierbaps/vignettes/introduction.html.

fasta.file.name <- system.file("extdata", "seqs.fa", package = "rhierbaps") snp.matrix <- load_fasta(fasta.file.name)

error: Error in load_fasta(fasta.file.name) : Invalid msa or the file does not exist!

hierBAPS doesn't appear to run in parallel?

Hello Gerry,

I'm running rhierbaps on a cluster (it's a bigger alignment, with 3k sequences and ~ 60k polymorphic sites). It seems to be running OK, but it seems to only be using 1 CPU despite the command looking like this:

hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, n.cores = 64)

Any ideas why this is? Thank you in advance.

Big fasta input can't be done.

Hi,

I have a big fasta file containing 2000 sequence alignment results with11G.

When I use the "snp.matrix <- load_fasta(fasta.file.name)" command, it will run for a very long time, may a few hours, and finally, the program will be automatically killed because of "out of memory".

I'm curious if there are other ways to approach this, in the face of very large alignment fasta files (too many sequences), or is there a suggested file size for input fasta ?

All levels and all partition in one group

Dear @gtonkinhill

I am struggling to figure out why "good_example" would result in several different partitions, while "bad_example" would result in all isolates being in the same group. I guess that it is related to gaps but I'm not sure how to overcome this problem. Could you please help me with this?

Thanks in advance!

# bad example
bad_example <- rbind(
  b01=c("a","t","-","c","-","c","g","a","a","c","c","a","t","t","t","a","g","g","t","t"),
  b02=c("a","-","c","t","t","c","g","a","a","c","c","a","-","t","t","a","g","g","t","t"),
  b03=c("a","t","c","t","t","c","g","a","a","c","c","a","t","t","t","a","g","g","t","t"),
  b04=c("a","t","c","c","t","c","g","a","a","c","c","a","t","t","t","a","g","-","t","t"),
  b05=c("a","t","c","t","t","c","g","a","a","t","c","-","t","t","t","a","g","g","-","t"),
  b06=c("a","t","c","t","t","c","g","a","g","t","-","a","t","-","t","a","g","g","t","t"),
  b07=c("-","t","c","t","t","c","g","a","g","t","c","a","t","t","t","a","g","g","t","t"),
  b08=c("a","t","c","t","t","-","g","-","g","t","c","a","t","t","t","a","g","g","t","t"),
  b09=c("a","t","c","c","t","c","a","a","a","t","c","a","t","t","t","a","-","g","t","-"),
  b10=c("a","t","c","c","t","c","a","a","a","t","c","a","t","t","-","-","g","g","t","t")
  )

baps_bad <- hierBAPS(bad_example, max.depth = 3)
baps_bad$partition.df$`level 3`

# good example
good_example <- rbind(
  g01=c("a","g","c","a","a","c","g","t","a","g","a","t","a","g","g","t","g","a","g","g"),
  g02=c("a","g","c","a","a","c","-","t","a","g","a","t","a","g","g","t","g","a","g","g"),
  g03=c("a","g","c","a","a","c","g","t","-","g","a","t","a","g","g","t","g","a","g","-"),
  g04=c("g","g","c","a","g","c","g","t","a","g","a","c","g","-","-","t","g","g","g","g"),
  g05=c("a","g","c","a","a","c","g","t","a","g","a","c","g","g","g","t","-","g","g","g"),
  g06=c("a","g","c","a","a","-","g","t","a","-","a","t","g","g","g","-","g","g","g","g"),
  g07=c("a","g","c","a","a","c","g","t","a","g","a","t","g","g","g","t","g","g","g","g"),
  g08=c("a","g","-","-","a","c","g","t","a","g","a","t","g","g","g","t","g","g","g","g"),
  g09=c("a","g","c","a","a","c","g","t","a","g","-","t","g","g","g","t","g","g","g","g"),
  g10=c("g","-","c","a","g","c","g","-","a","g","a","t","g","g","g","t","g","g","-","g")
)

baps_good <- hierBAPS(good_example, max.depth = 3)
baps_good$partition.df$`level 3`

RAM and time to run on the included test data

Gerry,

A colleague ran rhierbaps on 1 core with the test data (which seems to be about 500 samples x 3000 sites), and it took 2 days to complete.

Does this sound correct?
How much RAM does it need? (as a function of of samples, sites, ...)
Does it need the full alignment, or just the variant sites?

Any help appreciated!

Torsten

Problem read my multifasta

Dear team rhierbaps

I have problem read my multifasta.

library(rhierbaps)
fasta.file.name <- system.file("extdata", "new_3_snp.fasta", package = "rhierbaps")
fasta.file.name
[1] ""
snp.matrix <- load_fasta(fasta.file.name)
Error in load_fasta(fasta.file.name) :
Invalid msa or the file does not exist!

I can see that the fasta.file.name is empty.

I check if all if okey with the data(woodmouse) example, all run okey.

I put a resume of my multifasta:

FD01851321.fasta.ref
CGGACAGCTTCAAGGTTGTTGTTGCTCCCTAGTCAGAGTCTGCGCGATGGGAGTCAACAGCTGGCCTCCAACCTTTTTGC
GCTTAATTTTGTCTGGAACCAGCGGCATGGACCTCCGAATATTTCCGTACAACGACTATCTGTCGCAGGGACGGGGTGAT
CGCTCGGGTAGGCTATATCTTTGTTCTACAAACCCATCGCGAAGTCTATACATGCTGACCTTTGCTAGGTACCTCCGTCT
AGCGGCCCGCTCCGAATAGCCGGTACTTATACTAGAGTCTCGGAGGGATTAAGCGGTATGCCAAGAGAACGAGGCTGTTG
TGATATGATCAATCTTATAAAAGGTAGGAGTGACGTCTTATAGTGTGCATTGTTTCTGTCAGAATGATAGCCCTCTAGTA
AGAACAGTCCGAATCCCCGGAACGGGCGGGAAGAGACTGACGGTAAAATAAGCTGCGATCCCGCACGTATTTACTCTACA
GAGTTTTCGCTCTCGCTCCTATCCCTCCGTCCGTCCTAAGAGGATGCATGTGCGTTACTGGATGCTGTGTGTCGTCAAAA
AGCCCGAAGCGCCATTGGTCCGCTAGCGCATACGATGCGCGTATGAACGCAGCCGGTCAACTGTTCGTTTGGGCTTGGTG
AAAGACTTGGGTTTAACTGCATGGTCGAATATAAGCTATCACTAAAACCCTAGTTCCTAACGGGGAAGCGCCTTCAGGGT
GAGGGAAGACACCGTCAGAGTAGGTATCAGAGGGACAGGCAACATGGGCTCATATAGCCTTGGGCTCGACCAAGAGGCAA
GGTGCAGATCCGAGTCTAGAGATTCGGACGCATTGGCGCGGAGGTCGAGTTGGCCGCATGGGAAAATGCTACGTCAGGAA
ATGCAGCTAGGCGCACAGGCGCCAGACGAGCATATGGGGCACCATCTCGATCCCGACCCCGTGGCCAGGGCAGCTGTTGG
GGTGGGTAGCCGGCCGGGAATACGGATCGGAACCGTGGTGAAACACAGATTCTGCAGCCATCGACTTTCTTACGATGTAA
GGGGCTCAGTTTTTGCAGGCGGCGTACGGCGACGTCGCTACCTGCCCTTTGGAAACACACATACTGTGCGCGAAACCGCT
CCGGAACATGGGTCCCACCCTCACTCCCAACTTAGCGTTTGGAGCTCGTACACGTATGATACAAAATAAGGACGATCCTT
ACATCATCGGCGGAACCGTGTTACCCGTCATGCTAATTATCATGGATCTCCAATTTGTATACCAGGCAATACCGGGCAAA
AAGATCGCGAAGGGAGACTCAACACTGGCAACGGTCCCAGCCGTACGGGTTTCATATATGCAGCTCCGTTGAAACGGTCG
TTGGCTACATTCTGTAACGGCCGCGGTAGGAATCTGCGCGGTCCGCCTCCGCCGCATTTTCTGGTTATAAAATACATTGG
CTGTTGTGTATCGCCCCCGCACACCGAATTACGCGACTACTTACCATGGCTTTGGCACCGTTGGTCGGGTGTGAACCATG
GACAGGCTCGTCAGGACTATGCCGGTTGGAGTACTAAGCCTCAGCCAGTTACTGTTCTGCGTTACTAGCCGTGCGATTCA
TCTAGTCTGTCTCTCAACAACGATGGTTTTTCAGAAATATCTCCACCCAAGTCCCCGAATACGCCGGGACGTTCATCAGT
TGAGCAGTGCATAACTACCTGGAGATCTTTAAGCCCACGCATAGATGACGCTCTATCAGCATCCGGCGGACATAGACCAG
GCAACTGGAACCATGTGGTGTAAAAACAGATGCTCAAGCACACCGGGACGGCCCACACTCTTCATCGGCACCCAAGGAAA
AGCAGCTTCCCTCGGGTTTCAGGCTACTTAACTTAGGCCGGGGAACCCAGAGGTGCTATGCACTCGGGATGTTCGTCTAA
GAAATCGCAGATCCTTTGGATGGTTTGCGAACGGTGTATGCAATTTCGCCTTACAATCCATTTCGTCTAACACCAAGCCA
GCCTTCTTGTGAATGTCGTGCCAAAAGAAGAGTAGAGAATAGATGAACGCTTAGTCTCAGACCGGGAACTAGCGAGTATG
GCATTAAACGTGGTCCTGCAATTTGGACCTAGGGTGCCAATCAGACCTGTGCAACCGCCGCGACTACAGCCATTTGTTTT
TCAAAGACCACAGGTTTATCGTCCGCATAATGAGCGTGAACGCCGCGTGGCCACACACAGAAATACCGGTTTCATTTCCA
TCGAATTTGAGGTCTTTAGGCTGACGCGATGACGTGGGTCCAACGAGATTGCTGTTCGCGATATCGCTCGACAAATCGAC
GGCGAATACCGAATGGCATCTCAAGCCCCGCCCTATTTAGTACACTGGAGACGAGGCACAGCGATCCCCTGCCGCGAGGC
GCGTTAGCCGCCGCGGGTACAAGGTGAACCGCTTCAAGAAAACCTATGAGCTGCGAGCGAGAGGTCCGACCCATTAACAT
AGCACTGAGGCTCAGGAATCCAATCCCGCGCAAGACGATGTCCGATGAGCAACCTCCCAGCCTCCTGAGGGGAAGACGCC
AGTATCGGAAATACTAACCCCACAAAGATAGCCTCTTAGTTGGGTGGCCATCTTAAGTATCAAGGACGAAAAAGAGTCCG
CTAGGTTAAGCCCAAGG
895.fasta
CAGACAGCTTCAAGGTTGTTGTTGCTCCCTAGTCAGAGTCTGCGCGATGGGAGTCAACAGCTGGCCTCCAACCTTTTTGC
GCTTAATTTTGTCTGGAACCAGCGGCATGGACTTCCGAATATTTCCGTACAACGACTATCTGTCGCAGGGACGGGGTGCT
CGCTCGGGTAGGCTATATCTTTGTTCTACAAACCCATCGCGAAGTCTATACATGCTGACCTTTGCTAGGTACCTCTGTCT
AGCGGCCCGCTCCGAATAGCCGGTACTTATACCAGAGTCTCGGAGGGATTAAGCGGTATGCCAAGAGAACGAGGCTGTTG
TGATCTGATCAATCTTATAAAAGGTAGGGGTGACGTCTTATAGTGTGCATTGTTTCTGTCAGAATGATAGCCCTCTAGTA
AGAACAGTCCGAATCCCCGGAACGGGCGGGAAGAGACTGACGGTAAAATAATCTGCGATCCCACACGTATTTACTCTACA
GAGTTTTCGCTCTCGCTCCTATCCCACCGTCCGTCCTAAGAGGACGCATGTGCGTTACTGGATGCTGTGTGTCGTCAAAA
AGCCCGAAGCGCCATTGGTCCGCTAGCGCATACGATGCGCCTACGAACGCAGCAAGTCAACCGTTCGTTTGGGCTTGGTG
AAAGACTTGGGTTTAACTGCATGGTCGAATATAAGCTATCACTAAAACCCTAGTTCCTAACGGGGAAGCGCCTTCAGGGT
GAGGGAAGACACCGTCAGAGTAGGTATCAGAGGGACAGGCAACATGGGCTCATATAGCCTTGGGCTCGACCGGGAGGCAA
GGTGCAGATCCGAGTCTAGAGATTCAGACGCATTGGCGCGGAGGTCGAGCTGGCCGCATGGGAAAATGCTACGTCAGGAG
AAGCAGCTAGGCGCACAGGCGCCAGACGAGCATATGGGGCACCATCTCGATCCCGACCCCGTGGCCAGGGCTGTTGTTGG
GGTGGGTAGCCGGCCGGGAATACGGATCGGAACCATGGTGAAACACAGATTCTGCCACCATCGACTTTCTTACGATGTAA
GGGGCTCAGTTTTTGCAGGCAGCGTACGGCGACGTCGCTACCTGCCCTTTGGAAACACACATACTGTGCGCGAAACCACT
CCGGAACATGGGTCCCACCCTCACTCCCAACTTAGCGTTTGGAGCTCGTACACGTATGATACAAAATAAGGACTATCCTC
ACATCATCGGCGGAACCGTGTTACCCGTCATGCTAATTATCATGGATCTCCAATTTGTATACCAGGCAATACCGGGCAAA
AAGACCGCGAAGGGAGACTCAACACTGGTAACGGTCCCAGCCGTACGGGTTTCATATATGCAGCTCTGTTGAAACGGTCG
TTGGCTACATTCTGTAACGTCCGCGGTAGGAATCTGCGCGGTCCGCCTCCGCCGCATTTTCTGGTCATAAGATACATTGG
CTGTTGTGTATCGCCCCCGCACACCGAATTACGCGACGACTTACCATGGCTTTGGCACCGTTGGTCGGGTGCGAACCATG
GACAGGCTCGTCAGGACTATGCCGGTTGGAGTACTAAGCCTCAGCCAGTTACTGTTCTGTGTTACTAGCCTTGCGATTCA
TCTAGTCTGTCACTCAACAACGATGGTTTTTCAGAAATATCTCCACCCAAGTCCCCGAATACGCCGGGACGTTCATCAGT
TGAGCAGTGCATAACTACCTGGAGATCTTTAAGCCCACGCATAGATGACGCTCTATCAGTATCCGGCGGACATAGACCAG
GCAACTGGAACCATGTGGTGTAAAAACAGATGCTCAAGCACACCGGGACGGCCCACACTCTTCGTCGGCACCCAGGGAAA
AGCAGCTTCCCTCGGGTTTCAGGCTACTTAACTTAGGCCGGGGAGCCCAGAGGTGCTATGCACTCGGGATGTTCGTCTAA
GAAATCGCAGATCCTTTGGATGGTTTGCGAACGGTGTATGCAATTTCGCCTTACAATCCATTTCGTCTAACAACAAGCCA
GCCTTCTTGTGAATGTCGTGCCAAAAGAAGAGTAGAGAATAGATGAACGCTTAGTCTCAGACCGAGAACTAGCGAGTATA
GCATTAAACGTGGTCCTGCAATTTGGACCTAGGGTGCCAATCATACCTGGGCAACCGCCGCGACTACAGCCATTTGTTTT
TCAAAGACCACAGGTTTATCGTCCGCATAACGAGCGTGAACGCCGCGTGGCCACACACAGAAATACCGGTTTCATTTCCA
TCGAATTTGAGGTCTTTAGGCTGACGCGATGACGTGGGTCCAACGAGATTGCTGTTCGCGATATCGCTCGACAAATCGAC
GGCGAATACCGAATGGCATCTCAAGCCCCGCCCTATTTAGTACACTGGAGACGAGGCACAGCTACCCCCTGCCGCGAGGC
GCGTTAGCCGTCGCGAGTATAAGGTGAACCGCTTCAAGAAAACCTATGAGCTGCGAGCGAGAGGTCCGACCCATTAACAT
AGCACTGGGGCTCAGGAATCCAATCCCGCGCAAGACGATGCCCGATGAGCAACCTCCCAGCCTCCTGAGGGGAAAACGCC
AGTATCGGAAATACTAACCACACAAAGATATCCTCTTAGTTGGGTGCCCATCTTAAGTATCAAGGACGAAAAAGAGTCCG
CTAGGTTAAGCCCAAGG
and so on

Please help

Admixture assignments

Hello,

How are the cluster probabilities optionally produced by rhierBAPS related to admixture? I may be fundamentally misunderstanding something here, but do they correspond to the proportion of each isolate's genome that is thought to come from each cluster? I.e. do they correspond to what would be plotted on a STRUCTURE type bar plot?

(Different from #7 as I'm interested in the concept behind these probabilities -- not in generating a plot)

Speed-up r-hierbaps analysis

Hi @gtonkinhill ,

I was wondering if you have any suggestions on how to speed-up the r-hierbaps analysis.

I am running r-hierbaps on 3 datasets (output from Roary) with different sizes.
The first dataset has ~1,300 sequences. Running r-hierbaps on 1 node and 1 core took ~4 hours and ~130 GB of RAM. I am also using depth=1 and 150 populations.
For my second dataset with ~2,400 sequences, r-hierbaps ran for 11 days and consumed ~300GB of RAM.
The third dataset is significantly larger, with ~20,000 sequences. It is currently running on our cluster using 8 cores, but considering the previous running times and memory used, I am worried that it might not finish soon.

Thus, I was wondering if it is possible to run sub-parts of r-hierbaps in parallel and merge them afterwards, so the running time can be improved?
I know the MATLAB version of r-hierbaps is recommended for larger datasets, but I am having some issues with the MCR versions...

I would really appreciate any suggestions regarding this!

Thank you,
Natasha

Big data loading.

Hello

When I load a large fasta file, my R reports the following error.

sparse.data <- import_fasta_sparse_nt(fasta.file.name)
Error: cannot allocate vector of size 15.3 Gb
`

I think this is a generic problem that users encounter when using R, so I was just wondering if you had a good solution, especially when using fastbaps.