choderalab / pymbar Goto Github PK

Python implementation of the multistate Bennett acceptance ratio (MBAR)

License: MIT License

Shell 0.42% Python 98.61% PowerShell 0.59% Batchfile 0.36% HTML 0.02%

pymbar free-energies python thermodynamic-states equilibrium molecular-dynamics-simulations single-molecule-pulling mbar multistate-bennett-acceptance-ratio bennett-acceptance-ratio

pymbar's Introduction

pymbar

Python implementation of the multistate Bennett acceptance ratio (MBAR) method for estimating expectations and free energy differences from equilibrium samples from multiple probability densities. See our docs.

Installation

The easiest way to install the pymbar release is via conda:

conda install -c conda-forge pymbar

which will come with JAX to speed up the code. Or to get the non-JAX accelerated version:

conda install -c conda-forge pymbar-core

You can also install JAX accelerated pymbar from the Python package index using pip:

pip install pymbar[jax]

or the non-jax-accelerated version with

pip install pymbar

Whether you install the JAX accelerated or non-JAX-accelerated version does not change any calls or how the code is run. The non-Jax version is smaller on disk due to smaller dependencies, but may not run as fast.

The development version can be installed directly from github via pip:

# Get the compressed tarball
pip install https://github.com/choderalab/pymbar/archive/master.tar.gz
# Or obtain a temporary clone of the repo with git
pip install git+https://github.com/choderalab/pymbar.git

Usage

Basic usage involves importing pymbar and constructing an MBAR object from the reduced potential of simulation or experimental data.

Suppose we sample a 1D harmonic oscillator from a few thermodynamic states:

>>> from pymbar import testsystems
>>> x_n, u_kn, N_k, s_n = testsystems.HarmonicOscillatorsTestCase().sample()

We have the nsamples sampled oscillator positions x_n (with samples from all states concatenated), reduced potentials in the (nstates,nsamples) matrix u_kn, number of samples per state in the nsamples array N_k, and indices s_n denoting which thermodynamic state each sample was drawn from.

To analyze this data, we first initialize the MBAR object:

>>> mbar = MBAR(u_kn, N_k)

Estimating dimensionless free energy differences between the sampled thermodynamic states and their associated uncertainties (standard errors) simply requires a call to compute_free_energy_differences():

>>> results = mbar.compute_free_energy_differences()

Here results is a dictionary with keys Deltaf_ij, dDeltaf, and Theta. Deltaf_ij[i,j] is the matrix of dimensionless free energy differences f_j - f_i, dDeltaf_ij[i,j] is the matrix of standard errors in this matrices estimate, and Theta is a covariance matrix that can be used to propagate error into quantities derived from the free energies.

Expectations and associated uncertainties can easily be estimated for observables A(x) for all states:

>>> A_kn = x_kn # use position of harmonic oscillator as observable
>>> results = mbar.compute_expectations(A_kn)

where results is a dictionary with keys mu, sigma, and Theta, where mu[i] is the array of the estimate for the average of the observable for in state i, sigma[i] is the estimated standard deviation of the mu estimates, and Theta[i,j] is the covariance matrix of the log weights.

See the docstring help for these individual methods for more information on exact usage; in Python or IPython, you can view the docstrings with help().

JAX needs 64-bit mode

PyMBAR needs 64-bit floats to provide reliable answers. JAX by default uses 32-bit (Single) bitsize. PyMBAR will turn on JAX's 64-bit mode, which may cause issues with some separate uses of JAX in the same code as PyMBAR, such as existing Neural Network (NN) Models for machine learning.

Authors

Kyle A. Beauchamp [email protected]
John D. Chodera [email protected]
Levi N. Naden [email protected]
Michael R. Shirts [email protected]

References

Please cite the original MBAR paper:

Shirts MR and Chodera JD. Statistically optimal analysis of samples from multiple equilibrium states. J. Chem. Phys. 129:124105 (2008). DOI
Some timeseries algorithms can be found in the following reference:

Chodera JD, Swope WC, Pitera JW, Seok C, and Dill KA. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. J. Chem. Theor. Comput. 3(1):26-41 (2007). DOI
The automatic equilibration detection method provided in pymbar.timeseries.detectEquilibration() is described here:

Chodera JD. A simple method for automated equilibration detection in molecular simulations. J. Chem. Theor. Comput. 12:1799, 2016. DOI

License

pymbar is free software and is licensed under the MIT license.

Thanks

We would especially like to thank a large number of people for helping us identify issues and ways to improve pymbar, including Tommy Knotts, David Mobley, Himanshu Paliwal, Zhiqiang Tan, Patrick Varilly, Todd Gingrich, Aaron Keys, Anna Schneider, Adrian Roitberg, Nick Schafer, Thomas Speck, Troy van Voorhis, Gupreet Singh, Jason Wagoner, Gabriel Rocklin, Yannick Spill, Ilya Chorny, Greg Bowman, Vincent Voelz, Peter Kasson, Dave Caplan, Sam Moors, Carl Rogers, Josua Adelman, Javier Palacios, David Chandler, Andrew Jewett, Stefano Martiniani, and Antonia Mey.

Notes

alchemical-analysis.py described in this publication has been relocated here.

pymbar's People

Contributors

Stargazers

Watchers

Forkers

kyleabeauchamp trendelkampschroer msultan gmgray2 jchodera mrshirts gchevrot rmcgibbo jpthompson17 smcantab bioinformaticsarchive fepanalysis julimox lnaden greatlse craabreu dotsdl bmanubay oscara1 chayast luwei0917 richardjgowers rahmanhpu vreuter malithakabir brycestx ironhacktech2 ocmadin jessamynsmith kamran-haider javadnoroozi karnesh vvoelz qize shinobugit alejogiley tdd365 naveen584 bieniekmateusz somous-jhzhao sabrygad checkerbored tuckerburgin mfhossam1992 jingxiangguo landao125 morgeral proteneer chrisjonesbsu rsdefever bdice masterwhook bwang-ecnu yscript wlsong nkm-ml ernie2k sanderborgmans minghao2016 layeqa roshan2004 film1994 myonkunas seral17 trendingtechnology shirtsgroup theo-julien jharrymoore ainaadekunle tienhungf91 mikemhenry freeenergylab paulie-ai shawnhsueh heisenbergguo taizoayase mattwthompson ftry rosadche badisa kntkb daico007 aqhali nielskm mns-csharp schuhmc riddlezyc

pymbar's Issues

Create pymbar2.0 branch

So we should consider maintaining separate branches for 1.0 and 2.0, with future pull requests being merged towards the 2.0 branch.

Documentation improvements

I wanted to start a thread to discuss what additions/changes to the excellent documentation Kyle has started [http://pymbar.readthedocs.org] we might want now or in the future.

Some quick things:

I think I'll need to update the version available via the PyPI, last updated 11 Aug 2013: https://pypi.python.org/pypi/pymbar/2.0.1-beta
Any thoughts on what sort of numbering/naming/revision scheme we should be using, since each version on PyPI must be a unique version number?
I think we can mention on the Getting Started page that easy_install pymbar should also work.
We should add some tutorials illustrating the application to the example systems we have in pymbar-examples that will eventually form the foundation for what we've been colloquially referring to as the "MBAR for Dummies" page.
@mrshirts : What do you think we should focus on first, and how should the tutorials be structured?
@kyleabeauchamp : Do you want to be the lead author on that paper, since this would also be a nice way for you to get credit for what you've done here? The notes for that paper are here in the old svn: https://simtk.org/websvn/wsvn/pymbar/manuscripts/#_manuscripts_
Rework main README.md to also include reference to the online documentation or potentially just direct the user there (to remove redundancy).

Avoid caching values except when necessary

So a lot of intermediate calculations are being cached as member variables in the various classes.

Various functions are then implemented as member functions that proceed to lookup the cached values.

I would suggest that we avoid caching values except when necessary for performance reasons (e.g. in the iterative solver).

I would also suggest that each "calculation" should have a function--not member function--that lists all parameters as arguments and has a clear docstring of what the arguments are.

The reason I suggest this is that I think it really makes it clearer what we are calculating and what the calculation depends on.

Obviously, there will be cases where we have to do some caching for performance reasons, but I think we're over-using it right now.

Make sure pymbar-datasets is consistent with 2.0 release

title says it all.

Update nosetests and tests run by travis-ci

Move BAR to separate module

I think code would be easier to maintain and read if BAR, MBAR, and ExpGauss were all separate python files.

Fix test for pymbar.mbar.MBAR.computePMF

See failure here:
https://travis-ci.org/choderalab/pymbar/builds/23562743#L752

Create drop-in replacement for pymbar 2.0 MBAR object

This "legacy" interface could facilitate compatibility in existing codebases.

As a reminder, the key differences between 1.0 and 2.0 are probably:

Member function names use Python naming convention
Inputs are U_kn rather than U_kln
Minor algorithmic and function argument differences.

The legacy interface might initially just implement the most important behaviors, such as calculating f_k and computeExpectations().

Add FFT-based correlation time computation to timeseries module

Carl Rogers has sent a copy of his FFT version of the correlation time computation code, which could potentially be much faster than the current code. We should try to merge that in at some point.

Fix issues with pymbar-examples

So my working branch seems to break some compatability with pymbar-examples.

Actually, I wonder if some of the examples were already in a broken state. See also:

choderalab/pymbar-examples#1

PS: @mrshirts does not seem to be included on that repo.

Merge testsystems and testsystemsv2

I'll merge the testsystems and testsystemsv2, eliminating pandas dependency.
I'll also update doctests.
Stay tuned for PR.

Methods for computing asymptotic covariance matrix

It might be nice to reduce the number of methods available for computing the ACM.

My thinking is that everything needs to refer to a citation or have some more complete documentation.

Add automatic equilibration time detection to timeseries module

We should provide a scheme for allowing users to automatically detect and discard the equilibrated region of their data.

pandas support?

Is there a way to only require pandas if those utilities are used? I noticed that it's not installed in python on my cluster, and most functionality doesn't include this.

Fix broken docstring examples

I've found some docstring examples that are contain syntax errors.

Fix long imports

So it seems that right now, our __init__.py has one too many levels of depth because of how we import:

In [1]: import pymbar

In [2]: pymbar.pymbar.[...]

We can instead use from pymbar import xyz in __init__.py to reduce the tree.

Switch from distutils to setuptools to take advantage of test automation

We can more easily automate tests via setup.py with "python setup.py test" using setuptools.

NumExpr and Cython

These are two possible things that could help code maintainability and performance:

I think Cython makes significantly cleaner code than, e.g., weave. For example, here is an RMSD wrapper written in Cython. The key is that all pointers are handled by numpy array indices. (https://github.com/rmcgibbo/mdtraj/blob/master/MDTraj/IRMSD/rmsd_wrap.pyx).

I also recently found out that NumExpr can dramatically accelerate simple arithmetic expressions involving transcendental functions. For example, here is a simple implementation of LogSumExp that speeds things up 10X. I know that this operation is probably not rate limiting, but I suspect that similar types of expressions are rate limiting...

import numexpr as ne
import scipy.misc
x = np.random.normal(size=100000)
%timeit y = np.log(ne.evaluate("sum(exp(x))")
%timeit y = scipy.misc.logsumexp(x)

Here are results:

1000 loops, best of 3: 341 us per loop

and

100 loops, best of 3: 2.15 ms per loop

Feel free to close this issue immediately, I just want get this down for the record.

Release conda package

We should release a conda package, likely putting a recipe here:

https://github.com/omnia-md/conda-recipes

Separate functions for state independent and state dependent expectations

To me, it's confusing to have the same function work for either state dependent or state independent expectations. I'd rather have two functions that have very clear guidelines on what inputs they accept.

Move tests

Can we move the pymbar/pymbar/tests/ to pymbar/tests/?

We can keep pymbar/testsystems/ where it is.

Int vs. Float for N_k

Does it make sense to use float everywhere for N_k as a way to avoid silly integer arithmetic issues / casting issues? I suppose there is a slight reduction in resolution when doing this, but I don't think it's relevant--the resolution of a float64 is 1E-15...

removing (or moving) files in the 'old' directory

The statistical inefficiency code is in tests now, and discussing with Kyle, it would make sense to move the other two to pymbar-examples. This would also remove matplotlib requirements from anything in pymbar.

Have simtk pymbar project point to GitHub

We should make sure the simtk.org site points here now, since it's still the top google hit

Notify davecap that he can deprecate his davecap/pymbar GitHub

Dave Caplan had created a pymbar fork to add a setup.py:

https://github.com/davecap/pymbar

We can notify him that this can be deprecated once we have migrated everything here.

Change old testsystems to testsystems_v2

creating a repository for larger tests

As discussed before, we need a place to put larger stress-tests that strain the algorithms of MBAR until we can reduce them to automated tests.

Create 2.1 tag and release

BAR errors from overflow should be caught or worked around

Right now, there are occasionally errors with BAR. It ends up giving up and going to zero, but this error should be caught earlier.

pymbar.py:308: RuntimeWarning: overflow encountered in exp
log_f_R = - max_arg_R - numpy.log( numpy.exp(-max_arg_R) + numpy.exp(exp_arg_R - max_arg_R) ) - w_R

GPU acceleration?

At one point, I wrote a CUDA GPU-accelerated kernel. Is this something we should include in a future version?

Build conda package?

Should there be a conda package for this?

Error in computePMF example

I think I've spotted an error in the examples section of the MBAR.computePMF funcion. But it could also be me geting confused with the indices

>>> from pymbar import oldtestsystems
>>> [x_kn, u_kln, N_k] = oldtestsystems.HarmonicOscillatorsSample(N_k=[100,100,100])
>>> mbar = MBAR(u_kln, N_k)
>>> u_kn = u_kln[0,:,:]

If $u_{kln}=u_l(x_n^{(k)})$ shouldn't it be

>>> u_kn=u_kln[:, 0, :]

in the 3rd line?

In other words don't you use the potential energy at fixed state $l_0$ for data generated at different thermodynamic states $k$ to estimate the pmf at the desired state $l_0$?

Create Branch for 1.XX

Before we start merging changes for the 2.0 release, we should probably create a branch for the last 1.XX release.

We have two options:

Take the current git and copy it to a separate branch
Copy the contents of the release tarball to a separate branch.

The problem with (1) is that I suspect the current codebase doesn't perfectly correspond to any particular released version.

The problem with (2) is that I think this "breaks" the Git model of code changes (fork and merge).

Not sure what the best option is, but I think we have to pick one of these two before we proceed on other merges.

Make sure there are no significant method naming changes between 2.0.1-beta and 2.1

Rename pymbar/exponential_averaging.py to pymbar/exp.py to match convention?

Is there any problem with renaming this?

Use Existing Numpy and Scipy Functions whenever possible

I wanted to list the ones I've thought of so that we can discuss whether there is a known defect in the numpy / scipy versions.

numpy.linalg.pinv
scipy.misc.logsumexp

Regarding logsumexp, the one in SKLearn is even better, as it does both max subtraction and reordering to minimize overflow and underflow:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py

Brainstorm ways to simplify zero vs nonzero states

The current code is pretty good, but I want to think if there's some way to make it easier to work with "empty" and non-empty states.

Run pyflakes and pep8 (syntax checking and style checking)

Fix tests to return non-zero exit values if something is wrong

We should have the various pymbar tests return with non-zero exit values computed values are way outside the expected range (e.g. six-sigma).

Release PyPI package

This mainly involves doing a python setup.py upload after we update the package.

Differences between pseudoinverse and np.linalg.pinv

So in my refactor, I switched from our custom pseudoinverse to np.linalg.pinv. I noticed some differences in output for "svd-ev", which I tracked down to the tolerance inputs.

np.linalg.pinv defaults to 1E-15, while our custom pinv defaulted to 1E-12.

Switching np.linalg.pinv to use an rcond of 1E-12 eliminated the differences and restored agreement with pymbar1.0

Convert docstrings and formatting to numpy style

Add version.py; handle version control stuff there

Consider logging for outputting debug info

Right now, we have a lot of optional arguments verbose=False. We could possibly just use Python's logging library to handle all this:

import logging
logging.basicConfig(level=logging.DEBUG)
logging.debug('This message should go to the output')
logging.info('So should this')
logging.warning('And this, too')

change copyright on sphinx template

I'm flattered, but I doubt you want my copyright on the docs.

matplotlib in requirements.txt

Where do we use matplotlib in pymbar?

Preserving svn history?

There's a lot of information in the pymbar svn repository -- for example, I'm now going back to figure out where we need to keep weights stored in logarithmic form by diffing against different versions of the log.

So, is there any way to preserve this history?

Convert to N x K storage for U_kln matrix

For data sets where we have different numbers of samples at different states, rather than having the energies stored in a KxKxN_each matrix, it would be more efficient to store data as a KxN_tot matrix, where N is the total number of sampled collected, and K is the number of states we are evaluating the energies at. This will take a bit of extra work, however,

Switch license to LGPL 2.1

What would you all think about switching the license to LGPL? If the goal is maximal usage, maybe a less restrictive license would help?

Variable length function outputs

This is a minor style comment, but a lot of our functions look like this:

[...]

return returns

Where returns is a list whose length various depending on a bunch of optional arguments to the function.

To me, this style is unclear. Whenever possible, I would prefer for our functions to return a fixed number of variables--and returned by name:

return u_kn, f_i, kitchen_sink