richelbilderbeek / raket Goto Github PK

What is the error we make, when nature has protracted speciation, and inference ignore this?

License: GNU General Public License v3.0

R 80.33% Shell 1.30% Rebol 0.27% TeX 18.09%

raket's Introduction

raket

raket is not intended to be fully tested!
The experiment itself is the test

Branch
master
develop

raket provides the R code used by [1]
raket_werper provides the scripts used by [1]

raket is an R package that uses

PBD: to generate incipient species trees and sample a species tree
pirouette: to convert a species phylogeny to a posterior
nLTT: to compare a species tree to all species trees in the posterior

Download raket locally

git clone https://github.com/richelbilderbeek/raket

Pipeline

See the README in the scripts folder

Installation

If you use the devtools R package, this is easy:

devtools::install_github("richelbilderbeek/raket")

raket assumes that BEAST2 is installed. To install BEAST2, from R do:

beastier::install_beast2()

This will download and extract BEAST2 to:

OS	Full path
Linux	`~/.local/share/beast/bin/beast.jar`
Windows	`C:/Users/<username>/Local/beast/bin/beast.jar`

FAQ

See FAQ.

There is a feature I miss

See CONTRIBUTING, at Submitting use cases

I want to collaborate

See CONTRIBUTING, at 'Submitting code'

I think I have found a bug

See CONTRIBUTING, at 'Submitting bugs'

There's something else I want to say

Sure, just add an Issue. Or send an email.

Package dependencies

Package
babette
beautier
beastier
phangorn
pirouette
tracerer

External links

BEAST2 GitHub

References

[1] Bilderbeek, Richel JC, and Rampal S. Etienne. "The error when inferring phylogenies with incipient species by a birth-death model.". First registration at the Open Source Framework on 2018-06-14. Second registration at the Open Source Framework on 2018-09-03.

raket's People

Contributors

Watchers

raket's Issues

Prerequisite: PBD with Inf as speciation completion rate

I can call PBD::sim_pbd with a SCR of 10^6, but maybe it does support Inf as well.

Exceeded step memory limit

p230198@peregrine:scripts cat process_1_873377.log
slurmstepd: error: Unable to create TMPDIR [/local/873123]: Permission denied
slurmstepd: error: Setting TMPDIR to /tmp
flechette::create_output_file("4028.RDa","out_4028.RDa")
sh: line 1: 190440 Killed                  java -jar ~/Programs/beast/lib/beast.jar -seed 1 -statefile /local/873377/RtmpG2I6Vm/beast2_2e6d9686c7a80.xml.state -overwrite /local/873377/RtmpG2I6Vm/beast2_2e6d91e4d62bc.xml > /dev/null 2> /dev/null
Error: exit_code == 0 is not TRUE
Execution halted
slurmstepd: error: Exceeded step memory limit at some point.

Prerequisite: phangorn::simSeq is correct

I cannot predict the number of non-silent mutations by phangorn::simSeq. Whatever I do, my prediction is off. I assume the problem is at my side, but it would be great to use another tool (seq-gen) to verify phangorn is correct (and hopefully I will find out what I overlooked in the process).

---
title: 'raket: mutation rate'
author: "Richel J.C. Bilderbeek"
date: "April 18, 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggtree)

Problem

What is the best mutation rate to pick for a crown age of 15M years?

Here, we assume to use the full DNA, but modify the fraction of DNA
actually used, f_dna_used, below:

f_dna_used <- 1.0
testit::assert(f_dna_used >= 0.0 && f_dna_used <= 1.0)

What should the alignment length be to have one mutation per 1K years?

Hypothesis

Once per 15M years

crown_age <- 15 # million years
mutation_rate <- f_dna_used * 1.0 / crown_age # chance per nucleotide per million years

1 nucleotide has a resolution of 15M years.
2 nucleotides have a resolution of 7.5M years
15 nucleotides have a resolution of 1M years
15K nucleotides have a resolution of 1K years

sequence_length <- 15000 # base pairs

Methods

1. Simulate a phylogeny with only two branches and the correct crown age

#set.seed(1)
sum_edge_length <- crown_age * 2

# Simulate a tree with two taxa and the desired summed edge length 
while (1) {
  tree <- TreeSim::sim.bd.age(age = crown_age, numbsim = 1, lambda = 1.0 / crown_age, mu = 0.0, frac = 1.0, mrca = TRUE, complete = FALSE)[[1]]
  if (length(tree$tip.label) == 2) break
}
testit::assert(length(tree$tip.label) == 2) # 2 taxa
testit::assert(sum(tree$edge.length) == sum_edge_length)
ggtree(tree) + geom_treescale(x = 0, width = crown_age, color = "red", offset = 0.01)

2. Predict the number of non-silent mutations

On average all nuceotides will change.
Of these mutations, each one has a 25% chance to be silent,
for example, to go from adenine to adenine.

The number of expected observable mutations is then:

expected_n_diffs_naive <- mutation_rate * sequence_length * crown_age * 0.75
print(expected_n_diffs_naive)

However, there will be some nucleotides that will be picked twice or more,
as there will be nucleotides that will never be picked.

Here we run a simple simulation to add this to our expectation:

calc_exp_n_diffs <- function(sequence_length, f_dna_used) {
  # Create an alignment of zeroes
  nucleotides <- rep(0, sequence_length)
  # One mutation per base pair
  n_mutations <- sequence_length * f_dna_used
  # Pick the indices that will have a mutation
  random_indices <- 1 + (sort(floor(runif(min = 0, max = sequence_length, n = n_mutations))))
  # Put a random base pair there
  testit::assert(all(random_indices >= 1))
  testit::assert(all(random_indices <= length(nucleotides)))
  for (random_index in random_indices) {
    nucleotides[random_index] <- sample(x = 0:3, size = 1)
  }
  # Return the number of non-silent nucleotides
  testit::assert(sum(nucleotides != 0) > 0)
  sum(nucleotides != 0)
}
expected_n_diffs_sim <- mean(replicate(n = 100, calc_exp_n_diffs(sequence_length, f_dna_used)))
print(expected_n_diffs_sim)

From C++ I expect 7111.68 (code is at appendix):

expected_n_diffs_cpp_naive <- 7111.68 # Expects 15K mutations
expected_n_diffs_cpp_smart <- 7112.42 # Number of mutations also follows an exponential distribution

From Wikipedia:

# http://treethinkers.org/jukes-cantor-model-of-dna-substitution/
expected_n_diffs_tt <- (3/4)*(1 - exp(-8 * mutation_rate * crown_age / 2)) * sequence_length
print(expected_n_diffs_tt)

3. Simulate some alignments

At the root of the tree, we put a sequence of just as.

root_sequence <- rep("a", sequence_length)
n_replicates <- 100
diffs_1 <- rep(NA, n_replicates)
diffs_2 <- rep(NA, n_replicates)

for (i in seq(1, n_replicates))
{
  set.seed(i)
  data <- phangorn::simSeq(
    tree, 
    l = sequence_length, 
    rootseq = root_sequence, 
    type= "DNA", 
    rate = mutation_rate
  )
  diffs_1[i] <- sum(as.character(data)[1, ] != root_sequence)
  diffs_2[i] <- sum(as.character(data)[2, ] != root_sequence)
}
n_diffs_measured <- (sum(diffs_1) + sum(diffs_2)) / (n_replicates * 2)
print(n_diffs_measured)

Here is the last alignment:

ape::image.DNAbin(ape::as.DNAbin(data), col = c("white", "red", "green", "blue"))

4. Plot the results

library(ggplot2)

ggplot(
  data = data.frame(n = c(diffs_1, diffs_2)),
  aes(x = n)
) + geom_histogram(binwidth = 100) + 
  geom_vline(xintercept = n_diffs_measured, col = "black") + 
  geom_vline(xintercept = expected_n_diffs_naive, col = "pink") + 
  geom_vline(xintercept = expected_n_diffs_sim, col = "red") + 
  geom_vline(xintercept = expected_n_diffs_cpp, col = "green") + 
  geom_vline(xintercept = expected_n_diffs_tt, col = "yellow") + 
  scale_x_continuous(limits = c(0, sequence_length))

Appendix

C++ Naive

#include <algorithm>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

int calc_non_silent(std::mt19937& rng_engine)
{
    const int sequence_lenghth{15000};
    std::vector<int> nucleotides(sequence_lenghth, 0);

    //Use -1, as distr is inclusive
    std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);

    std::uniform_int_distribution<int> nucleotide_distr(0, 3);

    const int n_mutations{sequence_lenghth};
    for (int i=0; i!=n_mutations; ++i)
    {
      const int index = index_distr(rng_engine);
      nucleotides.at(index) = nucleotide_distr(rng_engine);
    }
    return std::count_if(
      std::begin(nucleotides),
      std::end(nucleotides),
      [](const int x) { return x != 0; }
    );
}

int main()
{
  std::mt19937 rng_engine;

  const int n{1000};
  std::vector<int> v;
  for (int i=0; i!=n; ++i)
  {
    v.push_back(calc_non_silent(rng_engine));
  }
  const int sum = std::accumulate(std::begin(v), std::end(v), 0);
  const double average = static_cast<double>(sum) / static_cast<double>(n);
  std::cout << average << '\n';
}

C++ smart

#include <algorithm>
#include <iostream>
#include <iterator>
#include <numeric>
#include <vector>
#include <random>

int pick_n_mutations(
  std::mt19937& rng_engine,
  const double mut_rate)
{
  std::exponential_distribution<double> mut_distr(mut_rate);
  double t{0.0};
  int i{0};
  while (1)
  {
    //std::cout << i << ' ' << t << '\n';
    const double dt{mut_distr(rng_engine)};
    t += dt;
    ++i;
    if (t > 15.0) break;
  }
  return i;
}

int calc_non_silent(
  std::mt19937& rng_engine,
  const double mut_rate
)
{
    const int n_mutations = pick_n_mutations(rng_engine, mut_rate);

    const int sequence_lenghth{15000};
    std::vector<int> nucleotides(sequence_lenghth, 0);

    //Use -1, as distr is inclusive
    std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);

    std::uniform_int_distribution<int> nucleotide_distr(0, 3);

    for (int i=0; i!=n_mutations; ++i)
    {
      const int index = index_distr(rng_engine);
      nucleotides.at(index) = nucleotide_distr(rng_engine);
    }
    return std::count_if(
      std::begin(nucleotides),
      std::end(nucleotides),
      [](const int x) { return x != 0; }
    );
}


int main()
{
  std::mt19937 rng_engine;
  const double p_mutation = 1.0 / 15.0;
  const double mut_rate = 15000.0 * p_mutation;

  const int n{1000};
  std::vector<int> v;
  for (int i=0; i!=n; ++i)
  {
    v.push_back(calc_non_silent(rng_engine, mut_rate));
  }
  std::copy(std::begin(v), std::end(v), std::ostream_iterator<int>(std::cout, " "));
  const int sum = std::accumulate(std::begin(v), std::end(v), 0);
  const double average = static_cast<double>(sum) / static_cast<double>(n);
  std::cout << average << '\n';
}

Are sequential seeds correlated?

Run this code:

# If a random number if drawn with one seed,
# how long until it is drawn by differerently seeded
# RNG?
set.seed(2)
value <- runif(n = 1)
set.seed(1)
i <- 0
while (value != runif(n = 1)) {
  i <- i + 1
}  
print(i)

Results:

First seed	Second seed	Value printed
1	1	0
1	2	Takes too long
2	1	Takes too long

Conclusion: seeds are independent

Prerequisite: can create such a big data frame

A data.frame has its limits. Check these.

Improve and test pbd_expected_n_extant

In pbd_expected_n_extant:

Add code into the left-open tests
Replace testit::assert by proper if and stop. Also test the error messages, under abuse

Peregrine issue: file.exists(beast_log_filename) is not TRUE

flechette::create_output_file("8.RDa","out_8.RDa")
Error: file.exists(beast_log_filename) is not TRUE

4_create_nltt_files

Create parameters that create their files in the right folder

Currently, each raket_params stores all its file in a Peregrine-friendly temporary folder, which is great for testing, but useless on Peregrine:

if a file with path ~/raket_werper/data/1/parameters.RDa is created, all its files should be put in that same folder.

Add 'collect_pbd_params'

Test parameter set should be short

Travis takes too long to run them...

Be sure about number of mutations in twin alignment

Set crown age

Rscript -e raket::run_raket_from_file("./data/1/parameters.RDa")
Error in becosys::bco_pbd_sim(pbd_params = raket_params$pbd_params, crown_age = raket_params$inference_params$mrca_prior$mrca_distr$mean$value) : 
  'crown_age' or 'stem_age' must be set
Calls: <Anonymous> -> run_raket -> <Anonymous>
Execution halted

Put files in the right folders

p230198@peregrine:data find . 
.
./10
./10/parameters.RDa
./5
./5/parameters.RDa
./1
./1/parameters.RDa
./9
./9/parameters.RDa
./4
./4/parameters.RDa
./12
./12/parameters.RDa
./3
./3/parameters.RDa
./11
./11/parameters.RDa
./6
./6/parameters.RDa
./8
./8/parameters.RDa
./7
./7/parameters.RDa
./2
./2/parameters.RDa

Prerequisite: data frame in long form plottable as violin plot

system call failed: Cannot allocate memory

See the Travis logs here.

Need to 🕵️‍♀️

Prerequisite: after discarding a burn-in of 10%, 1000 values must remain

So I have to create 1111.11 values?

Check for Peregrine friendly filenames

There are Peregrine-unfriendly filenames in the raket parameters, e.g. (from log below) /local/tmp/RtmpcVLdF8/file1cc08453580ed.log. Check for these.

From a Peregrine run:

Rscript -e raket::run_raket_from_file("./data/10/parameters.RDa")
Registered S3 method overwritten by 'geiger':
  method            from
  unique.multiPhylo ape 
Error in value[[3L]](cond) : Could not estimate the marginal likelihood. 
Error message: File 'output_log_filename' not found. Could not find file with path '/local/tmp/RtmpcVLdF8/file1cc08453580ed.log'
site_model$name: JC69
clock_model$name: strict
tree_prior$name: birth_death
epsilon: 1e-12
rng_seed: NA
verbose: FALSE
beast2_working_dir: /home/p230198/.cache/file1cae877f27cff/beast2_tmp_folder1cae84973ab1d
beast2_bin_path: /home/p230198/.local/share/beast/bin/beast
Calls: <Anonymous> ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
In addition: Warning messages:
1: In file.rename(from = actual_log_filename, to = output_log_filename) :
  cannot rename file '/home/p230198/.cache/file1cae877f27cff/beast2_tmp_folder1cae84973ab1d/pbd_evidence.log' to '/local/tmp/RtmpcVLdF8/file1cc08453580ed.log', reason 'Invalid cross-device link'
2: In file.rename(from = actual_trees_filename, to = to) :
  cannot rename file '/home/p230198/.cache/file1cae877f27cff/beast2_tmp_folder1cae84973ab1d/pbd_evidence.trees' to '/local/tmp/RtmpcVLdF8/file1cc0815a59692.trees', reason 'Invalid cross-device link'
Execution halted

Create general parameter set

Create without error
Follow raket naming conventions
Match article

Create sampling parameters set

Create without error
Follow raket naming conventions
Match article

Create general parameter set file with razzo file setup

For the razzo project, me and @Giappo discussed how and where to store files. raket should follow this same setup.

Problems installing Rtools on R 3.5.0 on Windows 7

From @Giappo:

I am trying to install raket but I keep on getting the error message
complaining about Rtools version like:

install.packages("Rtools"); library(Rtools)
Installing package into ‘C:/Users/P274829/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘Rtools’ is not available (for R version 3.5.0)
Warning in install.packages :
unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
cannot open URL
'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
Error in library(Rtools) : there is no package called ‘Rtools’

Or, using some workaround found on internet, the best I can get is:

"Downloading GitHub repo richelbilderbeek/tracerer@master
from URL
https://api.github.com/repos/richelbilderbeek/tracerer/zipball/master
WARNING: Rtools is required to build R packages, but no version of
Rtools compatible with R 3.5.0 was found. (Only the following
incompatible version(s) of Rtools were found:3.3,3.4)

Please download and install Rtools 3.5 from
http://cran.r-project.org/bin/windows/Rtools/.
Installation failed: Could not find build tools necessary to build tracerer"

Do you know how to solve this?

Remove rkt_ prefix

2_create_input_files_general

Clock model and site model usage

Je zegt "Making the strict clock explicity is ignored by BEAST2, thus I
will not set it explicitly myself either.

Ik begrijp niet wat je bedoelt. Kun je dit nader uitleggen? Ik denk dat
het goed is om zowel een strict clock als een relaxed clock te proberen.
Hetzelfde geldt voor JC en GTR. De reden hiervoor is als volgt: vanuit
een theoretisch perspectief wil je precies weten wat de fout is die
gemaakt wordt door BEAST2 door de verkeerde prior te nemen. Daarvoor is
het dan nodig een strict clock en een JC substitutiemodel te gebruiken
als je dat ook gedaan hebt bij het simuleren van de data. Vanuit een
empirisch perspectief zou je echter kunnen zeggen dat het min of meer
standaard is om een relaxed clock en een GTR substitutiemodel te nemen,
omdat we niet weten wat het onderliggende model is bij echte data. Door
hier "verkeerde" keuzes te maken zou je misschien de error die je maakt
door een verkeerde prior te nemen (birth-death in plaats van PBD) kunnen
VERKLEINEN! Je geeft de inferentie dus meer flexibiliteit, en misschien
resulteert dat in een betere boom, ondanks dat de modellen die je erin
stopt niet overeen komen met wat de echte genererende modellen zijn. Ik
weet niet zeker of het waar is, maar het is sowieso de moeite waard om
te bekijken. Ook als de error GROTER wordt in plaats van kleiner, kun je
nog steeds zeggen dat het een empirisch perspectief is, en dat de
empiricus in dat geval een nog grotere fout maakt dan de theoreticus.

Safeguard against CBS

 raket_params <- create_test_raket_params()
>   raket_params$sampling_method <- "random"
>   out <- rkt_run(raket_params = raket_params)
Error in pirouette::pir_run(phylogeny = true_phylogeny, pir_params = raket_params$pir_params) : 
  Too few taxa to use a Coalescent Bayesian Skyline tree prior
Called from: pirouette::pir_run(phylogeny = true_phylogeny, pir_params = raket_params$pir_params)
Browse[1]>

Prerequisite: babette provides for constant known strict clock rate

Create general parameter set input files

It takes too long to create the full sampling data set

PBD::pbd_sim takes on average a couple of seconds to complete for the highest diversification rate of 0.4. This causes the tests to last longer than 10 mins, which is over the time Travis CI allows (without doing workarounds).

For which parameter settings are less than 1000 taxons expected?

AFAIK, there is no equation of the expected number of extant species for the PBD model [*], i.e.:

pbd_expected_n_extant <- function(crown_age, scr, sirg, siri, erg, eri) { }

I suggest to write this function for the PBD package (hence the pbd_ prefix) by simulating PBD trees and averaging. Clumsy, but AFAIK, the best we can do.

[*] To be precise, [1] claims it is too hard, and I could not find it in later papers.

References

[1] Etienne, Rampal S., and James Rosindell. "Prolonging the past counteracts the pull of the present: protracted speciation can explain observed slowdowns in diversification." Systematic Biology 61.2 (2012): 204-213.

Create test parameter set input files

3_create_posterior_files

Each parameter set must have an equal amount of elements

See each params_set element must be the same

Prerequisite: consistent sampling

See issue at PBD repo for most elegant solution.

If author accepts, call correct sampling method from PBD
If author declines, emulate this myself in raket

6_to_long

Prerequisite: data frame can be converted to long form

A data frame of 40K rows and 1024 cols (1K measurements per row), must be convertible to long form of 4M rows with 25 cols.

Check!

If not, use less rows.

If yes, you can assign yourself at the right of this text, at 'Assignees'.

No NA's in est_evidences table

test_that("use, random", {

  if (!beastier::is_on_travis()) return()

  skip("Sum of evidences deviates too much from one")
  raket_params <- create_test_raket_params()
  raket_params$sampling_method <- "random"
  expect_silent(run_raket(raket_params))
})

Error in est_evidences(fasta_filename = fasta_filename, experiments = pir_params$experiments,  : 
  Sum of evidences (aka marginal likelihoods) deviates too much from  one 
Sum: 0
Tolerance: 0.1

Call pirouette from raket

A nice test case.

Peregrine issue: cannot remove file 'beast2.xml'

flechette::create_output_file("8.RDa","out_8.RDa")

In file.remove(beast2_input_filename) :
cannot remove file 'beast2.xml', reason 'No such file or directory'

richelbilderbeek / raket Goto Github PK

raket's Introduction

raket

Download raket locally

Pipeline

Installation

FAQ

There is a feature I miss

I want to collaborate

I think I have found a bug

There's something else I want to say

Package dependencies

External links

References

raket's People

Contributors

Watchers

raket's Issues

Problem

Hypothesis

Methods

1. Simulate a phylogeny with only two branches and the correct crown age

2. Predict the number of non-silent mutations

3. Simulate some alignments

4. Plot the results

Appendix

C++ Naive

C++ smart

References

Recommend Projects

Recommend Topics

Recommend Org