Giter Site home page Giter Site logo

richelbilderbeek / raket Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 17.39 MB

What is the error we make, when nature has protracted speciation, and inference ignore this?

License: GNU General Public License v3.0

R 80.33% Shell 1.30% Rebol 0.27% TeX 18.09%

raket's Introduction

raket

raket is not intended to be fully tested!
The experiment itself is the test
Branch Travis CI logo AppVeyor logo Codecov logo
master Build Status Build status codecov.io
develop Build Status Build status codecov.io
  • raket provides the R code used by [1]
  • raket_werper provides the scripts used by [1]

raket is an R package that uses

  • PBD: to generate incipient species trees and sample a species tree
  • pirouette: to convert a species phylogeny to a posterior
  • nLTT: to compare a species tree to all species trees in the posterior

Download raket locally

git clone https://github.com/richelbilderbeek/raket

Pipeline

See the README in the scripts folder

Installation

If you use the devtools R package, this is easy:

devtools::install_github("richelbilderbeek/raket")

raket assumes that BEAST2 is installed. To install BEAST2, from R do:

beastier::install_beast2()

This will download and extract BEAST2 to:

OS Full path
Linux ~/.local/share/beast/bin/beast.jar
Windows C:/Users/<username>/Local/beast/bin/beast.jar

FAQ

See FAQ.

There is a feature I miss

See CONTRIBUTING, at Submitting use cases

I want to collaborate

See CONTRIBUTING, at 'Submitting code'

I think I have found a bug

See CONTRIBUTING, at 'Submitting bugs'

There's something else I want to say

Sure, just add an Issue. Or send an email.

Package dependencies

Package Travis CI logo Codecov logo
babette Build Status codecov.io
beautier Build Status codecov.io
beastier Build Status codecov.io
phangorn Build Status codecov.io
pirouette Build Status codecov.io
tracerer Build Status codecov.io

External links

References

  • [1] Bilderbeek, Richel JC, and Rampal S. Etienne. "The error when inferring phylogenies with incipient species by a birth-death model.". First registration at the Open Source Framework on 2018-06-14. Second registration at the Open Source Framework on 2018-09-03.

First registration of the article

raket's People

Contributors

j-damhuis avatar richelbilderbeek avatar tomdkkr avatar neves-p avatar

Watchers

James Cloos avatar  avatar  avatar

raket's Issues

Exceeded step memory limit

p230198@peregrine:scripts cat process_1_873377.log
slurmstepd: error: Unable to create TMPDIR [/local/873123]: Permission denied
slurmstepd: error: Setting TMPDIR to /tmp
flechette::create_output_file("4028.RDa","out_4028.RDa")
sh: line 1: 190440 Killed                  java -jar ~/Programs/beast/lib/beast.jar -seed 1 -statefile /local/873377/RtmpG2I6Vm/beast2_2e6d9686c7a80.xml.state -overwrite /local/873377/RtmpG2I6Vm/beast2_2e6d91e4d62bc.xml > /dev/null 2> /dev/null
Error: exit_code == 0 is not TRUE
Execution halted
slurmstepd: error: Exceeded step memory limit at some point.

Prerequisite: phangorn::simSeq is correct

I cannot predict the number of non-silent mutations by phangorn::simSeq. Whatever I do, my prediction is off. I assume the problem is at my side, but it would be great to use another tool (seq-gen) to verify phangorn is correct (and hopefully I will find out what I overlooked in the process).

---
title: 'raket: mutation rate'
author: "Richel J.C. Bilderbeek"
date: "April 18, 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggtree)

Problem

  1. What is the best mutation rate to pick for a crown age of 15M years?

Here, we assume to use the full DNA, but modify the fraction of DNA
actually used, f_dna_used, below:

f_dna_used <- 1.0
testit::assert(f_dna_used >= 0.0 && f_dna_used <= 1.0)
  1. What should the alignment length be to have one mutation per 1K years?

Hypothesis

  1. Once per 15M years
crown_age <- 15 # million years
mutation_rate <- f_dna_used * 1.0 / crown_age # chance per nucleotide per million years 
  1. 1 nucleotide has a resolution of 15M years.
    2 nucleotides have a resolution of 7.5M years
    15 nucleotides have a resolution of 1M years
    15K nucleotides have a resolution of 1K years
sequence_length <- 15000 # base pairs

Methods

1. Simulate a phylogeny with only two branches and the correct crown age

#set.seed(1)
sum_edge_length <- crown_age * 2

# Simulate a tree with two taxa and the desired summed edge length 
while (1) {
  tree <- TreeSim::sim.bd.age(age = crown_age, numbsim = 1, lambda = 1.0 / crown_age, mu = 0.0, frac = 1.0, mrca = TRUE, complete = FALSE)[[1]]
  if (length(tree$tip.label) == 2) break
}
testit::assert(length(tree$tip.label) == 2) # 2 taxa
testit::assert(sum(tree$edge.length) == sum_edge_length)
ggtree(tree) + geom_treescale(x = 0, width = crown_age, color = "red", offset = 0.01)

2. Predict the number of non-silent mutations

On average all nuceotides will change.
Of these mutations, each one has a 25% chance to be silent,
for example, to go from adenine to adenine.

The number of expected observable mutations is then:

expected_n_diffs_naive <- mutation_rate * sequence_length * crown_age * 0.75
print(expected_n_diffs_naive)

However, there will be some nucleotides that will be picked twice or more,
as there will be nucleotides that will never be picked.

Here we run a simple simulation to add this to our expectation:

calc_exp_n_diffs <- function(sequence_length, f_dna_used) {
  # Create an alignment of zeroes
  nucleotides <- rep(0, sequence_length)
  # One mutation per base pair
  n_mutations <- sequence_length * f_dna_used
  # Pick the indices that will have a mutation
  random_indices <- 1 + (sort(floor(runif(min = 0, max = sequence_length, n = n_mutations))))
  # Put a random base pair there
  testit::assert(all(random_indices >= 1))
  testit::assert(all(random_indices <= length(nucleotides)))
  for (random_index in random_indices) {
    nucleotides[random_index] <- sample(x = 0:3, size = 1)
  }
  # Return the number of non-silent nucleotides
  testit::assert(sum(nucleotides != 0) > 0)
  sum(nucleotides != 0)
}
expected_n_diffs_sim <- mean(replicate(n = 100, calc_exp_n_diffs(sequence_length, f_dna_used)))
print(expected_n_diffs_sim)

From C++ I expect 7111.68 (code is at appendix):

expected_n_diffs_cpp_naive <- 7111.68 # Expects 15K mutations
expected_n_diffs_cpp_smart <- 7112.42 # Number of mutations also follows an exponential distribution

From Wikipedia:

# http://treethinkers.org/jukes-cantor-model-of-dna-substitution/
expected_n_diffs_tt <- (3/4)*(1 - exp(-8 * mutation_rate * crown_age / 2)) * sequence_length
print(expected_n_diffs_tt)

3. Simulate some alignments

At the root of the tree, we put a sequence of just as.

root_sequence <- rep("a", sequence_length)
n_replicates <- 100
diffs_1 <- rep(NA, n_replicates)
diffs_2 <- rep(NA, n_replicates)

for (i in seq(1, n_replicates))
{
  set.seed(i)
  data <- phangorn::simSeq(
    tree, 
    l = sequence_length, 
    rootseq = root_sequence, 
    type= "DNA", 
    rate = mutation_rate
  )
  diffs_1[i] <- sum(as.character(data)[1, ] != root_sequence)
  diffs_2[i] <- sum(as.character(data)[2, ] != root_sequence)
}
n_diffs_measured <- (sum(diffs_1) + sum(diffs_2)) / (n_replicates * 2)
print(n_diffs_measured)

Here is the last alignment:

ape::image.DNAbin(ape::as.DNAbin(data), col = c("white", "red", "green", "blue"))

4. Plot the results

library(ggplot2)

ggplot(
  data = data.frame(n = c(diffs_1, diffs_2)),
  aes(x = n)
) + geom_histogram(binwidth = 100) + 
  geom_vline(xintercept = n_diffs_measured, col = "black") + 
  geom_vline(xintercept = expected_n_diffs_naive, col = "pink") + 
  geom_vline(xintercept = expected_n_diffs_sim, col = "red") + 
  geom_vline(xintercept = expected_n_diffs_cpp, col = "green") + 
  geom_vline(xintercept = expected_n_diffs_tt, col = "yellow") + 
  scale_x_continuous(limits = c(0, sequence_length))

Appendix

C++ Naive

#include <algorithm>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

int calc_non_silent(std::mt19937& rng_engine)
{
    const int sequence_lenghth{15000};
    std::vector<int> nucleotides(sequence_lenghth, 0);

    //Use -1, as distr is inclusive
    std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);

    std::uniform_int_distribution<int> nucleotide_distr(0, 3);

    const int n_mutations{sequence_lenghth};
    for (int i=0; i!=n_mutations; ++i)
    {
      const int index = index_distr(rng_engine);
      nucleotides.at(index) = nucleotide_distr(rng_engine);
    }
    return std::count_if(
      std::begin(nucleotides),
      std::end(nucleotides),
      [](const int x) { return x != 0; }
    );
}

int main()
{
  std::mt19937 rng_engine;

  const int n{1000};
  std::vector<int> v;
  for (int i=0; i!=n; ++i)
  {
    v.push_back(calc_non_silent(rng_engine));
  }
  const int sum = std::accumulate(std::begin(v), std::end(v), 0);
  const double average = static_cast<double>(sum) / static_cast<double>(n);
  std::cout << average << '\n';
}

C++ smart

#include <algorithm>
#include <iostream>
#include <iterator>
#include <numeric>
#include <vector>
#include <random>

int pick_n_mutations(
  std::mt19937& rng_engine,
  const double mut_rate)
{
  std::exponential_distribution<double> mut_distr(mut_rate);
  double t{0.0};
  int i{0};
  while (1)
  {
    //std::cout << i << ' ' << t << '\n';
    const double dt{mut_distr(rng_engine)};
    t += dt;
    ++i;
    if (t > 15.0) break;
  }
  return i;
}

int calc_non_silent(
  std::mt19937& rng_engine,
  const double mut_rate
)
{
    const int n_mutations = pick_n_mutations(rng_engine, mut_rate);

    const int sequence_lenghth{15000};
    std::vector<int> nucleotides(sequence_lenghth, 0);

    //Use -1, as distr is inclusive
    std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);

    std::uniform_int_distribution<int> nucleotide_distr(0, 3);

    for (int i=0; i!=n_mutations; ++i)
    {
      const int index = index_distr(rng_engine);
      nucleotides.at(index) = nucleotide_distr(rng_engine);
    }
    return std::count_if(
      std::begin(nucleotides),
      std::end(nucleotides),
      [](const int x) { return x != 0; }
    );
}


int main()
{
  std::mt19937 rng_engine;
  const double p_mutation = 1.0 / 15.0;
  const double mut_rate = 15000.0 * p_mutation;

  const int n{1000};
  std::vector<int> v;
  for (int i=0; i!=n; ++i)
  {
    v.push_back(calc_non_silent(rng_engine, mut_rate));
  }
  std::copy(std::begin(v), std::end(v), std::ostream_iterator<int>(std::cout, " "));
  const int sum = std::accumulate(std::begin(v), std::end(v), 0);
  const double average = static_cast<double>(sum) / static_cast<double>(n);
  std::cout << average << '\n';
}

Are sequential seeds correlated?

Run this code:

# If a random number if drawn with one seed,
# how long until it is drawn by differerently seeded
# RNG?
set.seed(2)
value <- runif(n = 1)
set.seed(1)
i <- 0
while (value != runif(n = 1)) {
  i <- i + 1
}  
print(i)

Results:

First seed Second seed Value printed
1 1 0
1 2 Takes too long
2 1 Takes too long

Conclusion: seeds are independent

Create parameters that create their files in the right folder

Currently, each raket_params stores all its file in a Peregrine-friendly temporary folder, which is great for testing, but useless on Peregrine:

if a file with path ~/raket_werper/data/1/parameters.RDa is created, all its files should be put in that same folder.

Set crown age

Rscript -e raket::run_raket_from_file("./data/1/parameters.RDa")
Error in becosys::bco_pbd_sim(pbd_params = raket_params$pbd_params, crown_age = raket_params$inference_params$mrca_prior$mrca_distr$mean$value) : 
  'crown_age' or 'stem_age' must be set
Calls: <Anonymous> -> run_raket -> <Anonymous>
Execution halted

Put files in the right folders

p230198@peregrine:data find . 
.
./10
./10/parameters.RDa
./5
./5/parameters.RDa
./1
./1/parameters.RDa
./9
./9/parameters.RDa
./4
./4/parameters.RDa
./12
./12/parameters.RDa
./3
./3/parameters.RDa
./11
./11/parameters.RDa
./6
./6/parameters.RDa
./8
./8/parameters.RDa
./7
./7/parameters.RDa
./2
./2/parameters.RDa

Check for Peregrine friendly filenames

There are Peregrine-unfriendly filenames in the raket parameters, e.g. (from log below) /local/tmp/RtmpcVLdF8/file1cc08453580ed.log. Check for these.

From a Peregrine run:

Rscript -e raket::run_raket_from_file("./data/10/parameters.RDa")
Registered S3 method overwritten by 'geiger':
  method            from
  unique.multiPhylo ape 
Error in value[[3L]](cond) : Could not estimate the marginal likelihood. 
Error message: File 'output_log_filename' not found. Could not find file with path '/local/tmp/RtmpcVLdF8/file1cc08453580ed.log'
site_model$name: JC69
clock_model$name: strict
tree_prior$name: birth_death
epsilon: 1e-12
rng_seed: NA
verbose: FALSE
beast2_working_dir: /home/p230198/.cache/file1cae877f27cff/beast2_tmp_folder1cae84973ab1d
beast2_bin_path: /home/p230198/.local/share/beast/bin/beast
Calls: <Anonymous> ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
In addition: Warning messages:
1: In file.rename(from = actual_log_filename, to = output_log_filename) :
  cannot rename file '/home/p230198/.cache/file1cae877f27cff/beast2_tmp_folder1cae84973ab1d/pbd_evidence.log' to '/local/tmp/RtmpcVLdF8/file1cc08453580ed.log', reason 'Invalid cross-device link'
2: In file.rename(from = actual_trees_filename, to = to) :
  cannot rename file '/home/p230198/.cache/file1cae877f27cff/beast2_tmp_folder1cae84973ab1d/pbd_evidence.trees' to '/local/tmp/RtmpcVLdF8/file1cc0815a59692.trees', reason 'Invalid cross-device link'
Execution halted

Problems installing Rtools on R 3.5.0 on Windows 7

From @Giappo:

I am trying to install raket but I keep on getting the error message
complaining about Rtools version like:

Or, using some workaround found on internet, the best I can get is:

Please download and install Rtools 3.5 from
http://cran.r-project.org/bin/windows/Rtools/.
Installation failed: Could not find build tools necessary to build tracerer"

Do you know how to solve this?

Clock model and site model usage

Je zegt "Making the strict clock explicity is ignored by BEAST2, thus I
will not set it explicitly myself either.

Ik begrijp niet wat je bedoelt. Kun je dit nader uitleggen? Ik denk dat
het goed is om zowel een strict clock als een relaxed clock te proberen.
Hetzelfde geldt voor JC en GTR. De reden hiervoor is als volgt: vanuit
een theoretisch perspectief wil je precies weten wat de fout is die
gemaakt wordt door BEAST2 door de verkeerde prior te nemen. Daarvoor is
het dan nodig een strict clock en een JC substitutiemodel te gebruiken
als je dat ook gedaan hebt bij het simuleren van de data. Vanuit een
empirisch perspectief zou je echter kunnen zeggen dat het min of meer
standaard is om een relaxed clock en een GTR substitutiemodel te nemen,
omdat we niet weten wat het onderliggende model is bij echte data. Door
hier "verkeerde" keuzes te maken zou je misschien de error die je maakt
door een verkeerde prior te nemen (birth-death in plaats van PBD) kunnen
VERKLEINEN! Je geeft de inferentie dus meer flexibiliteit, en misschien
resulteert dat in een betere boom, ondanks dat de modellen die je erin
stopt niet overeen komen met wat de echte genererende modellen zijn. Ik
weet niet zeker of het waar is, maar het is sowieso de moeite waard om
te bekijken. Ook als de error GROTER wordt in plaats van kleiner, kun je
nog steeds zeggen dat het een empirisch perspectief is, en dat de
empiricus in dat geval een nog grotere fout maakt dan de theoreticus.

Safeguard against CBS

 raket_params <- create_test_raket_params()
>   raket_params$sampling_method <- "random"
>   out <- rkt_run(raket_params = raket_params)
Error in pirouette::pir_run(phylogeny = true_phylogeny, pir_params = raket_params$pir_params) : 
  Too few taxa to use a Coalescent Bayesian Skyline tree prior
Called from: pirouette::pir_run(phylogeny = true_phylogeny, pir_params = raket_params$pir_params)
Browse[1]> 

It takes too long to create the full sampling data set

PBD::pbd_sim takes on average a couple of seconds to complete for the highest diversification rate of 0.4. This causes the tests to last longer than 10 mins, which is over the time Travis CI allows (without doing workarounds).

For which parameter settings are less than 1000 taxons expected?

AFAIK, there is no equation of the expected number of extant species for the PBD model [*], i.e.:

pbd_expected_n_extant <- function(crown_age, scr, sirg, siri, erg, eri) { }

I suggest to write this function for the PBD package (hence the pbd_ prefix) by simulating PBD trees and averaging. Clumsy, but AFAIK, the best we can do.

[*] To be precise, [1] claims it is too hard, and I could not find it in later papers.

References

  • [1] Etienne, Rampal S., and James Rosindell. "Prolonging the past counteracts the pull of the present: protracted speciation can explain observed slowdowns in diversification." Systematic Biology 61.2 (2012): 204-213.

See also the cer_raket issue

mcbette: calculate weight for models

Currently, mcbette gives the marginal likelihoods for each of the models investigates. @Giappo suggested not to use the model with highest likelihood, but show the relative weights.

Add this to mcbette and use it in both raket and razzo ๐Ÿ‘

Fix test

The last build breaks: the documentation was not completed.

@J-Damhuis: can I assume you will also fix that? I know there are a lot of 'trivial' errors on your plate now, but, well, that helps you develop to a pro developer ๐Ÿ‘

If yes, you can assign yourself at the right of this text, at 'Assignees'.

No NA's in est_evidences table

test_that("use, random", {

  if (!beastier::is_on_travis()) return()

  skip("Sum of evidences deviates too much from one")
  raket_params <- create_test_raket_params()
  raket_params$sampling_method <- "random"
  expect_silent(run_raket(raket_params))
})
Error in est_evidences(fasta_filename = fasta_filename, experiments = pir_params$experiments,  : 
  Sum of evidences (aka marginal likelihoods) deviates too much from  one 
Sum: 0
Tolerance: 0.1

Simplify PBD taxon names

The PBD package generates trees with x-x-y taxon labels. Instead of using x-x, just use x. It makes things look simpler.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.