macroevolution / bamm Goto Github PK

A program for multimodel inference on speciation and trait evolution

License: GNU General Public License v2.0

CMake 0.32% C++ 99.57% Makefile 0.10%

bamm's Introduction

BAMM

A program for multimodel inference on speciation and trait evolution. Please see the project's website (http://bamm-project.org) for the full documentation.

Requirements

In order to compile BAMM, you need CMake and a C++11 compiler. You also need a Unix shell (e.g., bash) to run the following commands.

Installation

In the project's root directory, create a new directory called build and go into it:

mkdir build
cd build

To compile BAMM, run the following commands:

cmake ..
make -j

The final executable will be named bamm. You may run bamm from this directory, or you may install it in a more permanent location. To do this, run the following command within the build directory:

sudo make install

You may now run bamm from any directory in your system.

bamm's People

Contributors

Stargazers

Watchers

Forkers

jonchang bomeara sun-yanbo altingia yochju yuzhenpeng shiyi-pan sreetamabhadra changwanseo lzhangss

bamm's Issues

log file stream

appears that output stream to runinfofile is hanging.

mySettings.printCurrentSettings(runInfoFile)

does not appear to complete writing to file until the file is closed after the MCMC finishes. If I look at runinfofile during an MCMC run, it looks like it fails to write a number of parameters…but then it appears that they get written at the end of main() .

fastSimulatePrior class

I just implemented a new class, FastPriorSimulation. It simulates the distribution of the number of events under the prior only, which is important for evaluating model fit after a run is done. This class is many orders of magnitude faster than the previous way of doing sampleFromPrior only. I would like to have this implemented by default at the end of a run but didn't want to screw up anything going on with MCMC or TraitMCMC.

I think class MCMC should always (by default) create the class and do the prior simulation immediately after the MCMC simulation. It should generate an output file automatically that has some sort of label indicating that it is from the prior.

We leave the old "sampleFromPriorOnly" option in the code, but hide it so it doesn't appear in the control file anymore.

Then we always run with the new class at the end of each run, which generates a new file with 3 columns:

generation
N_shifts
eventRate

There is no log-likelihood or log-prior actually calculated for this class which is part of why it is vastly faster but it still generates a prior distribution on the number of rate shifts.

we could output to this file at the same frequency as mcmcWriteFreq (thus using the same param), and always do the prior simulation for the same number of generations as the main analysis.

Implementing it in class MCMC is trivial: this is all you need to do:

FastPriorSimulation fp(&randomptr, &settingsptr);

for (int i = 0; i < _NGENS; i++) {
fp.updateState();

if ((i % _mcmcWriteFreq) == 0) {
    writePriorStateToFile( &fp );
}

}

The only addition is that you will need a function WritePriorStateToFile in classes MCMC and TraitMCMC, which generates the output file.

Check tree for same tip node names

bamm traits segmentation fault when deleting event from tree

On mac os x 10.9, bamm traits.

Program received signal SIGSEGV, Segmentation fault.
TraitModel::forwardSetBranchHistories (this=0x7fff5fbfcd10, x=0x1002000b0) at TraitModel.cpp:1815
1815 } else if (x == myNode->getTraitBranchHistory()->getLastEvent()) {
(gdb) where
#0 TraitModel::forwardSetBranchHistories (this=0x7fff5fbfcd10, x=0x1002000b0)

at TraitModel.cpp:1815

#1 0x000000010002bf29 in TraitModel::deleteEventFromTree (this=0x1002000b0, be=)

at TraitModel.cpp:750

#2 0x000000010002c51b in TraitModel::changeNumberOfEventsMH (this=0x1002000b0)

at TraitModel.cpp:970

#3 0x00000001000301e0 in TraitMCMC::updateState (this=0x1000000000000000, parm=1606405392)

at TraitMCMC.cpp:232

#4 0x000000010002f4a1 in TraitMCMC::TraitMCMC (this=0x1002000b0, ran=,

mymodel=0x10004c5e0 <vtable for std::__1::basic_ifstream<char, std::__1::char_traits<char> >+64>, sp=<optimized out>) at TraitMCMC.cpp:134

#5 0x00000001000020db in main (argc=, argv=) at main.cpp:148

plot shifts on phylogenies

OK, I thought of another thing that would be useful (I'm writing some documentation and it's easy for me to see how some of this would be very helpful for the user).

A new function:

plotShiftsOnTree

that is just like nodelabels, but shows not the nodes of shifts, but puts the specified shifts exactly at their position on a branch. And should be able to work with plots generated by plot.phylo or plot.dtrates. As an argument it could just take a vector of nodes, like nodelabels. But needs to put them on branches, not nodes.

Here's what I have in mind: it will be very helpful to guide user intuition to say: let's plot the rates and shift configuration for a single sample from the posterior, to get a feeling for it.

mydata <- subsetEventData(eventdata, index = 1)
plot.dtrates(mydata) #here is rates plot
shiftnodes <- getShiftNodesFromIndex(eventdata, index=1)
plotShiftsOnTree(node=shiftnodes, cex=2, col="red")

or whatever.

Support overwrite arguments

By default, do not overwrite files. In controlfile, allow arguments to force overwriting.

Diverging runs even when providing random number seed (mac 10.9)

Providing a seed for the RNG should produce perfectly replicated runs. This works for non-10.9 mac machines (including Ubuntu 13.10). However, on at least one 10.9 machine (JWB) runs diverge after a few thousand generations. This is for both diversification and trait analyses.

This does not appear to be an issue with the RNG itself. JWB tested uniformRV() across 2 OSs (10.8.5 and 10.9) up to 10 million values, and things were consistent.

make traitcontrol_template.txt file

See divcontrol_template.txt file, which is heavily annotated/commented file that users can look at and know quickly what each parameter means. Need to do the same for traitcontrol using only traitcontrol specific parameters.

Debugging the dtRates function

The dtRates function used to generate the rates for coloring branches with a gradient needs improvement. When a segment used for approximating an instantaneous rate overlaps an event shift point on a branch the shift point is currently ignored when calculating the rates of that approximating segment. Some segments also seem to not get assigned a rate and they remain at 0 where they were initialized.

Need consistent output: BAMM diversification versus trait evolution

changes to code eg headers on output files do not appear to have been made to the bamm traits analyses. We should check all of these to make sure everything is consistent (also in event data files etc).

plotRateThroughTime

Have arg ephy take either bamm-data or bamm-ratematrix datattypes. should check, then (if bamm-data) call getRateThroughTimeMatrix. otherwise, use the already-computed rate-through-time matrix. Users will want to just compute the rtt matrix once, as it is time consuming, then move on to plotting multiple times while exploring options. So we can speed it up for them by allowing them to just compute the matrix once and keep sending it to plotRateThroughTime which they can then plot much more quickly.

Also suggest using round(seq(x, y, length.out = z)) in axis calls to ensure the right number of tick marks for different kinds of datasets.

C code for plotting dynamic rates

I moved the function that calculates the rates used for plotting dynamic rates changes on a phylogeny to C and interfaced with R using .Call. It seems to be working but those more comfortable with C and R extensions should take a look at it to be sure nothing weird might be happening. For now the source code is in the Rcode folder in a file called dtrates.c and the R interface is a function called dtRates in the plot.dtrates.R file, which also has the code for plotting.

Test of refactored classes

I've been doing a lot of restructuring of classes to avoid code duplication. I've been working in a separate branch, called "mcmc," which I've uploaded. I'd appreciate it if anyone could test the program for correctness before I merge the branch with the master. Please test the program on a variety of data sets and verify that the results make sense. You won't be able to compare the results with a previous program using the same seed because the restructuring has caused the random number sequence to be different.

To download the mcmc branch, do a "git pull", then "git checkout mcmc". To go back to the master, do "git checkout master".

Thanks!

Build fails on Joseph's Mac OS X v. 10.9.1

See d4ff0bc#commitcomment-4986595

BAMM should check for duplicate parameters when parsing control files

Not done

Only "final" functions for BAMMtools in Rcode/BAMMtools

Move other stuff to Rcode/devel or equivalent. Just flagging this issue for now, we can talk about what should be included.

Also, have someone test your code and see what suggestions they have for additional arguments etc.

dtrates plotting

dtrates works very well but some users will be a little confused by the plotting, especially the generation of multiple plot windows and the fact that they don't automatically clear. e.g., second call to the function automatically plots to a quartz window (on my machine) that is the same size as the mini hrates box.

not sure the best way to do this. One option is to have the following options:

plot href as pdf by default (and not plot in the display) but have argument to allow plotting in separate graphics window. e.g., asPDF_hrates = TRUE arg.

Have args hrates_height and hrates_width that define the size of the pdf, so the pdf is
pdf(height = …, width = …)
do plot stuff
dev.off()

or equivalent. The function could then print, as it plots, a message like:

"A histogram of rates and color key was written to file X", where X is the string from concatenating the file name for output plus the directory it was written to (I assume you could just write this to the current working directory). e.g., getwd() function.

Non- R-savvy users get confused by directories, so if you are writing a pdf "invisibly" like this, they won't know where their file is, since they (often) won't know where they are! Hence, printing out message would be good.

Could also have a similar option for the main rate tree, but maybe by default it is plotted in a graphics window.

I think pdf(…) might be platform independent? and as such, would be good as users could specify sizes of target plot in a way that gets mucked up on different systems when plotting directly with quartz.options or x11.options. otherwise, the function could just do the main tree plot in a window of whatever size the user already defined. So the help on the function could document that the user should call quartz.options or x11.options or whatever to set the size of the graphics window before calling the plotdtrates function.

Code duplication in MCMC and TraitMCMC

MCMC and TraitMCMC share a lot of code that can go into a common class. Only those things unique to MCMC or TraitMCMC should go in their own class.

Update post_processing R script(s)

Lots of R functions have changed in the past week. Time to update the "walkthrough" we have to make sure everything is still compatible. The graduate students are on it...

Make sure Makefile is up to date

Not clear whether all dependencies are up-to-date.

Deprecating parameters from speciationextinction analyses

To DO:

Remove support for:

acceptRateOutfile
lambdaNodeOutfile
updateRateNumberTimeVariablePartitions

from all classes. These are to be deprecated.
I have deleted from the controlfile
but not from MCMC, Settings, and Model classes (I don't think they are used anywhere else).

REMOVE FROM CONTROLFILES
BUT KEEP SUPPORTING:

(we do not want users to access these options at this time):

Make multithreading an option

I know the idea is to go eventually with Metropolis coupling, but an easy and quick option is to use multithreading with the current single-chain implementation. I've been playing with this, and it scales nicely; I just need to fix a synchronization problem. I will probably create a separate branch for this. The way I have been working is that there is a separate Makefile for the MT compilation, but this could take the form of 1) a flag during compilation, or 2) a single compilation, where MT is turned on (or off) via the control file.

Currently does not compile on Ubuntu 13.10

Haven't had a chance to try it on 13.9 but it appears the current (as of ~2 PM) version of the code no longer makes on my Linux machine. Yesterday afternoon's was fine. Error messages and log as follows:

enteril@Noctilio:~/Documents/Noctilio/Programs/bamm$ make
cd src && make bamm
make[1]: Entering directory /home/enteril/Documents/Noctilio/Programs/bamm/src' g++ -Wall -c -m64 -O3 -funroll-loops -g -DGIT_COMMIT_ID=\"43b1786793366b9be0d9df0776ddf84dae5b70cb\" main.cpp main.cpp: In function ‘void exitWithMessageUsage()’: main.cpp:170:5: error: ‘exit’ is not a member of ‘std’ std::exit(0); ^ main.cpp:170:5: note: suggested alternative: In file included from MCMC.h:12:0, from main.cpp:10: /usr/include/stdlib.h:542:13: note: ‘exit’ extern void exit (int __status) __THROW __attribute__ ((__noreturn__)); ^ main.cpp: In function ‘void exitWithErrorUnknownArgument(const string&)’: main.cpp:177:5: error: ‘exit’ is not a member of ‘std’ std::exit(1); ^ main.cpp:177:5: note: suggested alternative: In file included from MCMC.h:12:0, from main.cpp:10: /usr/include/stdlib.h:542:13: note: ‘exit’ extern void exit (int __status) __THROW __attribute__ ((__noreturn__)); ^ main.cpp: In function ‘void exitWithErrorNoControlFile()’: main.cpp:185:5: error: ‘exit’ is not a member of ‘std’ std::exit(1); ^ main.cpp:185:5: note: suggested alternative: In file included from MCMC.h:12:0, from main.cpp:10: /usr/include/stdlib.h:542:13: note: ‘exit’ extern void exit (int __status) __THROW __attribute__ ((__noreturn__)); ^ make[1]: *** [main.o] Error 1 make[1]: Leaving directory/home/enteril/Documents/Noctilio/Programs/bamm/src'
make: *** [bamm] Error 2

finish R functions for trait evolution

R functions for summarizing rates etc for speciationextinction model but I don't think they are yet finished for trait evolution.

Generate RUN_INFO.txt or equivalent

BAMM should , by default, write a RUN_INFO.txt file or equivalent.
This should include the random number seed that is used for any given run. It could also include info about the time and date of the run, etc. Anything people may ultimately want to go back and check, or that may be useful for bug fixes. Ultimately, when we have better exception handling, we will have errors generate enough information that we can re-create them using this seed.

R code speedups with C

Just flagging these to address at some point.getEventData and getRateThroughTimeMatrix could really benefit from calling C. Probably rateThroughTimeMatrix will be (vastly) faster. Not so sure about getEventData but it's painful with very large trees.

Write theoretical background

BAMM's theoretical background in the documentation needs to be written.

Deprecating params from BAMM traits

More params to deprecate, now from all the Trait-associated classes (plus Settings):

acceptrateOutfile
nodeStateOutfile
updateRateNumberTimeVariablePartitions

I have no idea what this param does and don't think it is being used:

rootPrior

I think it's leftover from very early developmental stage.

All removed from traitcontrol.txt

Write R package documentation

sphinx documentation

I'm just flagging this as an issue, but sphinx doc is not finished and this should stay open until it is done.
Also, even when done, we will need to go through all the example code and make sure it can be run in R, since - unlike sweave - there is no guarantee that the example code is functional, which ccould be maddening to new users.

The installation, quickstart, configurations, FAQ and setting page are all pretty open for completion. i can work on theory section and bammtools (but would be good to see what others think about bammtools section). if you have some pretty cool figs we can think about putting them in the graph gallery as well.

maximumShiftCredibilityTree( ) problem

In maximumShiftCredibilityTree( ), a probability is calculated for each sample in the posterior, and the best probability is chosen. However, in my tests, many samples have identical probabilities, making the "best" sample actually a vector of samples, which appears to break the function output.
My question is: Will samples with identical probabilities always have the same starting nodes that define the processes? If that's true, then any of the "best" could then be used to get the result.

package skeleton

unless someone has already started documenting things I'd suggest we hold off on package skeleton. I also think we should talk about who documents what and how to document things consistently.

Finish writing BAMMtools documentation

Some sections of the documentation on using BAMMtools needs to be finished.

Write Quick-start guide to BAMM

Helper functions for BAMM incomplete sampling input

Add helper fxn to BAMMtools to help users set up files with clade- or species-specific sampling probability.

Also test it

Make screen output user-friendly

Currently the output sent to the screen is not user-friendly. Would be nice to clean it up.

Update Configuration & Settings pages

Some parameters still need to be documented. There might also be old parameters not used any more or new parameters that need to be newly described.

Support for constrained (negative) phenotypic rate change model

I added a macro variable in TraitModel.cpp that specifies whether the betaShift parameter should be constrained to be negative. This can be commented out to have a fully unconstrained analysis (eg trait rates can increase and decrease through time). However, Mike found that there are real problems with this (not a programming issue, but real problems with the model itself when rates of trait evolution can increase exponentially through time).
Need to add user support to have easy on/off toggling of this constraint. This will involve changing TraitModel.cpp, TraitModel.h, Settings.cpp, and Settings.h

Class bammdata

Thoughts on changing the class of the object returned by getEventData to "bammdata" only? One advantage would be that a number of functions could dispatch methods on the bammdata object directly. So, for example,

x <- getEventData(phy, eventdata);
class(x);
[1] "bammdata" ##rather than [1] "phylo" "bamm-data"
plot(x) ##rather than calling plot.dtrates(x);
would immediately plot the dynamic rates tree if plot.dtrates is renamed to plot.bammdata. Because BAMMtools has the function as.phylo.bammdata is there a compelling reason to have two class names attached to the bammdata object?

Check for polytomies and non-ultrametric trees

Quit and warn if trees contain polytomies and/or are non-ultrametric.

Checks prior to running BAMM

In the course of running my datasets it seems to me there should be slightly more informative error messages on some basic tree issues so that people can fix them. The three things that have come up most often are: non-bifurcating trees, negative branch lengths (or 0 branch lengths), and malformed trees with internal nodes. Right now they give pretty uninformative error messages before aborting (or segmentation faulting), and it seems maybe these should be checked in the tree structure before actually attempting to run BAMM. Is it possible to do that within C? I know it's fairly easy to check in R first.

Use of system()

Linux complains while compiling when encountering the use of system(). I believe this is not fully compatible across all OSs. Maybe replace with something more portable.

autoconf

Use autoconf tools for configuring and compiling.

More example datasets needed

Need 2 example datasets with time calibrated trees and phenotypic data. Morphometric data on whales that can be paired with cetacean tree? See G Slater et al. Proc Soc B 2010. Current example dataset on skinks not OK as it isn't published yet.

Also need example dataset that can show workflow to set up incomplete sampling file.

segmentation fault for BAMMtraits on mac os x 10.9

I'm getting a seg fault with bamm traits again. Here is output from gdb.

pascals-mbp:test pascaltitle$ gdb ./bamm
GNU gdb (GDB) 7.6.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin13.0.0".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /Users/pascaltitle/bamm/test/bamm...
warning: `/Users/pascaltitle/bamm/src/Autotune.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/BranchEvent.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/BranchHistory.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/MCMC.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/MbRandom.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/Model.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/Node.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/Settings.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/TraitBranchEvent.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/TraitBranchHistory.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/TraitMCMC.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/TraitModel.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/Tree.o': file time stamp mismatch.

warning: `/Users/pascaltitle/bamm/src/main.o': file time stamp mismatch.
(no debugging symbols found)...done.
(gdb) run -control traitcontrol.txt
Starting program: /Users/pascaltitle/bamm/test/./bamm -control traitcontrol.txt

Initializing BAMM using control file <<traitcontrol.txt>>
Reading control file <<traitcontrol.txt>>
Read a total of <<40>> parameter settings from control file
Using clock to seed random number generator
trait
Initializing phenotypic evolution (trait) model

TRAIT DIVERSIFICATION BAMM

Reading tree from file <skinks183.tre>
1 tree read with 183 taxa
Tree ctor: event nodes: 0
Reading phenotypes from file <<skinks_shape.txt>>
Read a total of 183 species w trait data
Missing data for < 0 > species.
These will be treated as latent variables in analysis

count of FIXED nodes in getPhenotypesMissingLatent: 183
count of VARIABLE nodes in getPhenotypesMissingLatent: 182

Setting initial trait values at internal nodes
Initializing model and MCMC chain

Initializing model object....
Min and max phenotype limits set using observed data:
Min: -4.17108 Max: 8.60668
Root beta: 0.5 0.5 Shift: 0
Model object successfully initialized.
Initial log-likelihood: -609.742

Initializing Trait MCMC object...
Output file for MCMC data exists:
overwriting bmcmcout.txt
Output file for beta exists:
overwriting bbetarates.txt
Output file for node states exists:
overwriting bnodestate.txt
Output file for event data:
overwriting beventdata.txt
MCMC object successfully initialized.

Running MCMC chain for 1000000 generations.

Generation lnLik N_shifts LogPrior acceptRate
1 -609.828 0 2.07466 Accp: 1

Program received signal SIGSEGV, Segmentation fault.
0x0000000100028753 in TraitModel::deleteRandomEventFromTree() ()
(gdb) where
#0 0x0000000100028753 in TraitModel::deleteRandomEventFromTree() ()
#1 0x0000000100028ac1 in TraitModel::changeNumberOfEventsMH() ()
#2 0x000000010002c770 in TraitMCMC::updateState(int) ()
#3 0x000000010002ba31 in TraitMCMC::TraitMCMC(MbRandom_, TraitModel_, Settings*) ()
#4 0x00000001000017ab in main ()

dt rate trees

Hi MIke- I think the calculations in the dt rates functions are dependent on the ordering coming in in eventBranchSegs. Might be good to make the calculations either independent, or to internally re-order the eventBranchSegs. I had some issues when I was speeding up getEventData but when I order eventBranchSegs by node number it appears to be fine. Otherwise works great.

Check and update comments for control files

The new control files are much more organized and well-commented. However, there are places where the comments are not very descriptive or are missing altogether. I have tried to fix this, but there are parameters I'm not sure how to comment. If you know what the parameter does and how it should be used (and know a good range of values), please add a comment to this parameter. It would also be helpful if you could make the corresponding update to the online documentation (in the doc directory).

Support user-supplied random number seed

For replication purposes. Default to clock (if provided seed == -1), otherwise use user-provided seed.

lambda, mu, and beta outfiles

We should make lambda, mu, and beta outfiles optional. All the information to reconstruct those is in the event data file. They are just a shortcut to getting results from the event data file. Now, for users who have large trees, they can potentially generate gigabytes of output that they will not want if they output those files too frequently. This will be a real problem in some cases. those output files can be so large that they'll crash a system and are unreadable with a standard text editor.

So - we do not want to generate those files automatically.

Use user-supplied prefix to name output files

Instead of providing names for all output files in the control file, provide a single prefix (with bamm providing the appropriate suffix).