I cannot predict the number of non-silent mutations by phangorn::simSeq
. Whatever I do, my prediction is off. I assume the problem is at my side, but it would be great to use another tool (seq-gen) to verify phangorn
is correct (and hopefully I will find out what I overlooked in the process).
---
title: 'raket: mutation rate'
author: "Richel J.C. Bilderbeek"
date: "April 18, 2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggtree)
Problem
- What is the best mutation rate to pick for a crown age of 15M years?
Here, we assume to use the full DNA, but modify the fraction of DNA
actually used, f_dna_used
, below:
f_dna_used <- 1.0
testit::assert(f_dna_used >= 0.0 && f_dna_used <= 1.0)
- What should the alignment length be to have one mutation per 1K years?
Hypothesis
- Once per 15M years
crown_age <- 15 # million years
mutation_rate <- f_dna_used * 1.0 / crown_age # chance per nucleotide per million years
- 1 nucleotide has a resolution of 15M years.
2 nucleotides have a resolution of 7.5M years
15 nucleotides have a resolution of 1M years
15K nucleotides have a resolution of 1K years
sequence_length <- 15000 # base pairs
Methods
1. Simulate a phylogeny with only two branches and the correct crown age
#set.seed(1)
sum_edge_length <- crown_age * 2
# Simulate a tree with two taxa and the desired summed edge length
while (1) {
tree <- TreeSim::sim.bd.age(age = crown_age, numbsim = 1, lambda = 1.0 / crown_age, mu = 0.0, frac = 1.0, mrca = TRUE, complete = FALSE)[[1]]
if (length(tree$tip.label) == 2) break
}
testit::assert(length(tree$tip.label) == 2) # 2 taxa
testit::assert(sum(tree$edge.length) == sum_edge_length)
ggtree(tree) + geom_treescale(x = 0, width = crown_age, color = "red", offset = 0.01)
2. Predict the number of non-silent mutations
On average all nuceotides will change.
Of these mutations, each one has a 25% chance to be silent,
for example, to go from adenine to adenine.
The number of expected observable mutations is then:
expected_n_diffs_naive <- mutation_rate * sequence_length * crown_age * 0.75
print(expected_n_diffs_naive)
However, there will be some nucleotides that will be picked twice or more,
as there will be nucleotides that will never be picked.
Here we run a simple simulation to add this to our expectation:
calc_exp_n_diffs <- function(sequence_length, f_dna_used) {
# Create an alignment of zeroes
nucleotides <- rep(0, sequence_length)
# One mutation per base pair
n_mutations <- sequence_length * f_dna_used
# Pick the indices that will have a mutation
random_indices <- 1 + (sort(floor(runif(min = 0, max = sequence_length, n = n_mutations))))
# Put a random base pair there
testit::assert(all(random_indices >= 1))
testit::assert(all(random_indices <= length(nucleotides)))
for (random_index in random_indices) {
nucleotides[random_index] <- sample(x = 0:3, size = 1)
}
# Return the number of non-silent nucleotides
testit::assert(sum(nucleotides != 0) > 0)
sum(nucleotides != 0)
}
expected_n_diffs_sim <- mean(replicate(n = 100, calc_exp_n_diffs(sequence_length, f_dna_used)))
print(expected_n_diffs_sim)
From C++ I expect 7111.68
(code is at appendix):
expected_n_diffs_cpp_naive <- 7111.68 # Expects 15K mutations
expected_n_diffs_cpp_smart <- 7112.42 # Number of mutations also follows an exponential distribution
From Wikipedia:
# http://treethinkers.org/jukes-cantor-model-of-dna-substitution/
expected_n_diffs_tt <- (3/4)*(1 - exp(-8 * mutation_rate * crown_age / 2)) * sequence_length
print(expected_n_diffs_tt)
3. Simulate some alignments
At the root of the tree, we put a sequence of just a
s.
root_sequence <- rep("a", sequence_length)
n_replicates <- 100
diffs_1 <- rep(NA, n_replicates)
diffs_2 <- rep(NA, n_replicates)
for (i in seq(1, n_replicates))
{
set.seed(i)
data <- phangorn::simSeq(
tree,
l = sequence_length,
rootseq = root_sequence,
type= "DNA",
rate = mutation_rate
)
diffs_1[i] <- sum(as.character(data)[1, ] != root_sequence)
diffs_2[i] <- sum(as.character(data)[2, ] != root_sequence)
}
n_diffs_measured <- (sum(diffs_1) + sum(diffs_2)) / (n_replicates * 2)
print(n_diffs_measured)
Here is the last alignment:
ape::image.DNAbin(ape::as.DNAbin(data), col = c("white", "red", "green", "blue"))
4. Plot the results
library(ggplot2)
ggplot(
data = data.frame(n = c(diffs_1, diffs_2)),
aes(x = n)
) + geom_histogram(binwidth = 100) +
geom_vline(xintercept = n_diffs_measured, col = "black") +
geom_vline(xintercept = expected_n_diffs_naive, col = "pink") +
geom_vline(xintercept = expected_n_diffs_sim, col = "red") +
geom_vline(xintercept = expected_n_diffs_cpp, col = "green") +
geom_vline(xintercept = expected_n_diffs_tt, col = "yellow") +
scale_x_continuous(limits = c(0, sequence_length))
Appendix
C++ Naive
#include <algorithm>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>
int calc_non_silent(std::mt19937& rng_engine)
{
const int sequence_lenghth{15000};
std::vector<int> nucleotides(sequence_lenghth, 0);
//Use -1, as distr is inclusive
std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);
std::uniform_int_distribution<int> nucleotide_distr(0, 3);
const int n_mutations{sequence_lenghth};
for (int i=0; i!=n_mutations; ++i)
{
const int index = index_distr(rng_engine);
nucleotides.at(index) = nucleotide_distr(rng_engine);
}
return std::count_if(
std::begin(nucleotides),
std::end(nucleotides),
[](const int x) { return x != 0; }
);
}
int main()
{
std::mt19937 rng_engine;
const int n{1000};
std::vector<int> v;
for (int i=0; i!=n; ++i)
{
v.push_back(calc_non_silent(rng_engine));
}
const int sum = std::accumulate(std::begin(v), std::end(v), 0);
const double average = static_cast<double>(sum) / static_cast<double>(n);
std::cout << average << '\n';
}
C++ smart
#include <algorithm>
#include <iostream>
#include <iterator>
#include <numeric>
#include <vector>
#include <random>
int pick_n_mutations(
std::mt19937& rng_engine,
const double mut_rate)
{
std::exponential_distribution<double> mut_distr(mut_rate);
double t{0.0};
int i{0};
while (1)
{
//std::cout << i << ' ' << t << '\n';
const double dt{mut_distr(rng_engine)};
t += dt;
++i;
if (t > 15.0) break;
}
return i;
}
int calc_non_silent(
std::mt19937& rng_engine,
const double mut_rate
)
{
const int n_mutations = pick_n_mutations(rng_engine, mut_rate);
const int sequence_lenghth{15000};
std::vector<int> nucleotides(sequence_lenghth, 0);
//Use -1, as distr is inclusive
std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);
std::uniform_int_distribution<int> nucleotide_distr(0, 3);
for (int i=0; i!=n_mutations; ++i)
{
const int index = index_distr(rng_engine);
nucleotides.at(index) = nucleotide_distr(rng_engine);
}
return std::count_if(
std::begin(nucleotides),
std::end(nucleotides),
[](const int x) { return x != 0; }
);
}
int main()
{
std::mt19937 rng_engine;
const double p_mutation = 1.0 / 15.0;
const double mut_rate = 15000.0 * p_mutation;
const int n{1000};
std::vector<int> v;
for (int i=0; i!=n; ++i)
{
v.push_back(calc_non_silent(rng_engine, mut_rate));
}
std::copy(std::begin(v), std::end(v), std::ostream_iterator<int>(std::cout, " "));
const int sum = std::accumulate(std::begin(v), std::end(v), 0);
const double average = static_cast<double>(sum) / static_cast<double>(n);
std::cout << average << '\n';
}