rinikerlab / reeds Goto Github PK

This pipeline is executing a RE-EDS run for relative free energy calculations. It can be used for example for calculation of hydration free energies or ligand binding free energies.

Home Page: https://rinikerlab.github.io/reeds/

License: MIT License

Python 99.98% Shell 0.02%

drug-design free-energy-calculations gromos python hpc replica-exchange enhanced-sampling eds workflow

reeds's Introduction

REEDS

Replica Exchange - Enveloping Distribution Sampling (RE-EDS) is a method to calculate the relative free energy of multiple states in a system. It can be applied to calculate relative solvation free energies or relative binding free energies with Gromos. One advantage of this method is that no alchemical transition path from one end state to another end state is required thanks to Enveloping Distribution Sampling (EDS) by Christ et al. . EDS allows to explore the end state graph during the simulation with multiple different alchemical transition paths.

The enhanced sampling method Replica Exchange was added by Sidler et al. to speed up the sampling and ease the choice of parameters. Additionally, multiple modules were described by Sidler to allow an automatization of the pipeline. In this repository, we now combined these approaches to an automatic scheme for RE-EDS.

The repository aims to make the RE-EDS pipeline accessible to everyone!

For more on RE-EDS checkout:

Structure

This Project contains the code for the RE-EDS workflow of the rinikerlab. In the example folder, you can find the system input data for the pipeline, a template script folder, and several jupyter notebooks covering and explaining certain aspects of the pipeline. The reeds folder contains some template files, a function library, and ready-to-use scripts for the RE-EDS steps.

Additional: The submodule folder contains PyGromosTools, which is used extensively in the RE-EDS code. The subfolder tests are used for the automatic code testing mainly applied to the energy offset estimation and the roundtrip optimizers.

Code

The code is written in python3 and requires a compiled version of Gromos. At the current stage, an LSF queue is required to submit the simulation jobs. The required python packages are listed in devtools/conda-envs/full_env.yaml.

Install

Using this repository, clone it (like the command below) into a directory on your machine and add the path to the repo to your python path.

git clone --recurse-submodules <repo url>

Make sure you have the required python packages from devtools/conda-envs/full_env.yaml. You can install the provided env with Anaconda like:

conda env create -f devtools/conda-envs/full_env.yaml

If you want to update the code of the PyGromos submodule, you can do this:

git submodule init
git submodule update

If you're writing code for this repository, please develop it on your own branch first.

 git branch mybranch    #generate your branch
 git checkout mybranch  #switch to your branch
 git merge main   #for adding new features from main to your branch

Copyright

Acknowledgments

Project based on the Computational Molecular Science Python Cookiecutter version 1.3.

reeds's People

Contributors

Stargazers

Watchers

Forkers

riesben shiyx409 tsenapathi lgtm-migrator chloebazin akosny candidechamp awooner1

reeds's Issues

remove equilibration period from undersampling detection

I found that for some systems and initial coordinates, there can be a few exceptionally high energy values in the first few steps of the simulation of the lower bound search for the lowest s-values. This can lead to an overestimation of the undersampling threshold, with the consequence that undersampling is detected at an s-value which is too high. This in turn means that longer simulations are required for the free energies to converge.

The straightforward solution is to remove the first e.g. 10% of the sampled energies from the undersampling threshold detection. This is implemented in https://github.com/rinikerlab/reeds/tree/undersampling_detection_equilibration

new repdat + imd format

In the new parallel GromosXX RE-EDS code, the format of the repdat output file changed slightly. When a new repdat is parsed with the pygromos file parser, I get the following error message:

Traceback (most recent call last):
  File "job_analysis.py", line 76, in <module>
    do_Reeds_analysis(in_folder=in_folder, out_folder=out_folder, gromos_path=gromos_path, topology=topology, in_ene_ana_lib=in_ene_ana_lib, in_imd=in_imd, optimized_eds_state_folder=optimized_eds_state_folder, state_undersampling_occurrence_potential_threshold=state_undersampling_occurrence_potential_threshold, state_physical_occurrence_potential_threshold=state_physical_occurrence_potential_threshold, undersampling_frac_thresh=undersampling_frac_thresh, grom_file_prefix=grom_file_prefix, title_prefix=title_prefix, n_processors=n_processors, verbose=verbose, control_dict=control_dict, )
  File "/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/pipeline/worker_scripts/analysis_workers/RE_EDS_general_analysis.py", line 303, in do_Reeds_analysis
    boundary_conditions=boundary_conditions)
  File "/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/file_management/file_management.py", line 1266, in reeds_project_concatenation
    repdat_file_paths=repdat_file_paths, verbose=verbose)
  File "/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/file_management/file_management.py", line 208, in thread_worker_concat_repdat
    repdat_file = repdat.Repdat(repdat_file_paths.pop(0))  # repdat Class
  File "/cluster/work/igc/wsalome/code/reeds/reeds/submodules/pygromos/pygromos/files/repdat.py", line 48, in __init__
    system, data = parser.read_repdat(input_path, Vj_header=True)
  File "/cluster/work/igc/wsalome/code/reeds/reeds/submodules/pygromos/pygromos/files/_basics/parser.py", line 608, in read_repdat
    df = df.astype({'ID': 'int32', 'partner': 'int32', 'run': 'int32','s': 'int32'})
  File "/cluster/work/igc/wsalome/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/generic.py", line 5526, in astype
    "Only a column name can be used for the "
KeyError: 'Only a column name can be used for the key in a dtype mappings argument.'

The issue can be solved by doing the renaming df.rename(columns={"coord_ID": "ID"}, inplace=True).

What I don't understand is line df.rename(columns={"exch": "s"}, inplace=True). In the old repdat format, there is a column called exch, which contains booleans that encode whether the exchange between two replicas was accepted or rejected. The same column is still present in the new format, and it doesn't correspond to the s-values. @schroederb do you know what the idea behind this is?

old functions (ported from gitlab)

by @SalomeRonja :

I'm currently working on the documentation for the analysis. In reeds/reeds/function_libs/analysis/analysis.py there is a function called fetch_lig_atom which is not called from anywhere in the reeds or pygromos code. I'm also not sure what the original purpose of the function was. Do we still need this function @schroederb ?

missing state_potential_treshold in sampling analysis of lower bound analysis

in the function call of reeds.function_libs.analysis.sampling.sampling_analysis(...) in reeds/function_libs/pipeline/worker_scripts/analysis_workers/RE_EDS_explore_lowerBound_analysis.py there is a missing required positional argument state_potential_treshold.

If I understand correctly, this is the list which should contain the physical sampling thresholds from step a, right @schroederb ? If that's the case, I would quickly add the code needed to read those in from the analysis directory of step a.

two versions of optimize_energy_offsets (ported from gitlab)

There are two functions called optimize_energy_offsets, one in reeds/function_libs/optimization/src/energy_offset_estimator.py and one in reeds/function_libs/optimization/src/energy_offset_optimization. The first one is a class member function, and the second one is an independent function. Only the second one is called during the calculation of the energy offsets in reeds/function_libs/optimization/eds_energy_offsets.py, but I'm not sure if the first one is still needed somewhere else?

add fraction threshold to global settings file

see #18:

Personally, I would prefer to have a fixed default value, and adapt it from the input scripts - I think it makes it easier to have a consistent main branch and not accidentally merge other default values from out own branches without other users noticing.

If you don't like adapting it in multiple places, which is understandable, we could add an option to add it to the global definition file, so you only have to change it in one place and then the other submission scripts take the value from there.

But if you both think it's nicer to change the threshold directly in the function, that's of course also fine for me :)

Originally posted by @SalomeRonja in #18 (comment)

remove/adapt historical V_tresh

This is low priority, but the submission scripts still contain the variable pot_tresh, and V_tresh is still reported in a few plots, e.g. the sampling_hist plots. We should make sure that we remove and/or replace these occurrences appropriately.

energy offset plot - sampling_timeseries - error (ported from gitlab)

by @candidechamp :

The plots called sampling_timeseries_sXXX.png which are generated by the function plot_t_statepres() (found in visualisation.py) show incorrect values for the x-axis of the plots.

These plots are saved in the /data/eoff directory after a run.

an example can be found in /cluster/home/cchampion/work/PNMT/eoffs_default/analysis/eoff where the correct values for the x-axis are used in the timeseries of the potential energies: /cluster/home/cchampion/work/PNMT/c_eoffs_default/analysis/plots

Make Pipeline executable locally and eulerisch (ported from gitlab)

by @schroederb

The idea here is to make the pipeline executable on multiple different systems (or at least, 'quickly' transferable to)
The goal for a start would be now to have:

Dummy: prints out all commands does nothing
LOCAL: executes all commands without queueing
LS

F: Uses the LSF Queuing system by IBM, like on Euler.

(- SLURM: another frequently used queuing system)

Dependency : This depends on the new PyGromos branch, that is bringing 2 more execution systems -> we would need to update PyGromos and test them.

colors in plot_style.py

In reeds/function_libs/visualization/plots_style.py there are several color lists which are used for the visualization, e.g. gradient_blue_list or candide_colors.

In these lists, the colors are listed explicitly. This causes issues for systems that consist of more ligands than there are colors, which will happen more often in the future as we start applying RE-EDS to larger sets, like my benzene test set consisting of 26 ligands. It also gets more and more tedious to choose the colors manually, as the system sizes grow.

Hence, I would suggest that we start using a non-discrete color map where the colors are chosen according to the number of ligands in the system. E.g. the cycler package using something like cycler('color', plt.cm.jet(np.linspace(0,1,num_ligs))) , where num_ligs is the number of ligands in the system.

@schroederb @candidechamp what do you think? If you agree, I would add this change to the visualization scripts.

Input coordinates for step d should be taken from d_sopt - input (ported from gitlab)

by @candidechamp :

Adjust the paths to the correct input.

also for step a

Random Cluster Bugs

Hi @candidechamp,
the new copying back from the cluster scratch, did fail for me!
The run did not copy all files back to the work folder (cnfs missing).
It looks, like somehow the copying after run fails/ the scratch folder is not found anymore?

Does that also happen for you?

Here I attached the output file:
CHK1_nd5_enr3_complex_prod_1SS_21r_3_sopt4_rb3_max8_md.txt

analysis of step c - error (ported from gitlab)

By @candidechamp:

I get an error in my analysis of step c (energy offsets), related to the plotting of the plots in the "s_optimization" folder.

The data was generated with the pipeline's default parameter, and the run can be found in:
/cluster/home/cchampion/work/PNMT/eoffs_default

The exact error message reads:

Traceback (most recent call last):
  File "job_analysis.py", line 22, in <module>
    do_Reeds_analysis(in_folder=in_folder, out_folder=out_folder, gromos_path=gromos_path, topology=topology, in_ene_ana_lib=in_ene_ana_lib, in_imd=in_imd, optimized_eds_state_folder=optimized_eds_state_folder, pot_tresh=pot_tresh, frac_tresh=frac_tresh, grom_file_prefix=grom_file_prefix, title_prefix=title_prefix, n_processors=n_processors, verbose=verbose )
  File "/cluster/home/cchampion/programs/reeds/reeds/function_libs/pipeline/worker_scripts/analysis_workers/RE_EDS_general_analysis.py", line 373, in do_Reeds_analysis
    analysis.get_s_optimization_transitions(out_dir=out_dir, rep_dat=in_file, title_prefix=title_prefix)
  File "/cluster/home/cchampion/programs/reeds/reeds/function_libs/analysis/analysis.py", line 596, in get_s_optimization_transitions
    vis.plot_replica_transitions_min_states(single_transition_trace, s_values=old_svals,
  File "/cluster/home/cchampion/programs/reeds/reeds/function_libs/analysis/visualisation.py", line 1619, in plot_replica_transitions_min_states
    y_axis_for_s_plots(ax=ax, s_values=s_values, yBond=yBond, ammount_of_y_labels=ammount_of_y_labels)
  File "/cluster/home/cchampion/programs/reeds/reeds/function_libs/analysis/visualisation.py", line 116, in y_axis_for_s_plots
    ax.set_yticklabels([nice_y[ind]  for ind in range(len(y_range[::step_size]))][::-1])
  File "/cluster/home/cchampion/programs/reeds/reeds/function_libs/analysis/visualisation.py", line 116, in <listcomp>
    ax.set_yticklabels([nice_y[ind]  for ind in range(len(y_range[::step_size]))][::-1])
IndexError: list index out of range

I would imagine the bug comes from one of the new additions in the latest merge requests.

lowest s-value in step b is 1 for vacuum simulations

The new version of choosing the lowest s-value leads to the s-values being cut at 1 in my vacuum simulations

The s-values are already cut at 1. This leads to other problems in the pipeline, because no intermediate s-values can be added. I think we should add a criterion to say that the cut cannot be made higher than e.g. 0.01. @schroederb & @candidechamp what do you think?

lost files during copying from scratch

With the new parallel code, and using more nodes per simulation, the issue of files being lost during the copy-back from Euler's scratch seems to have increased. @epbarros and I both experienced this issue. I lowered the threshold on the s-values for using a local work directory on the node instead of working on scratch in #48.

@epbarros @schroederb @candidechamp if you're all happy with this, it can be merged.

SolveTheMergeQueue

Let's solve the pile of changes

Merge Order:

add potential_tresholds
New Eoff Code #18
fixing small bug in sopt vis #19
Energy offset input prep following ssm approach #20

SSM approach not working on eoff rebalance branch

I just noticed that for my newest simulations using the eoff rebalancing branch #51 , the ssm approach was not working (i.e. not multiple replicas with s = 1 added for parameter optimization).

I'm currently investigating if the problem is related to my input scripts/setup. @epbarros @candidechamp @schroederb if you're currently running simulations on the eoff rebalancing branch with SSM turned on it would be great if you could let me know if you're also experiencing this issue. Thanks a lot!

try to minimize Unit_Test duration (ported from gitlab)

by @schroederb

the unit tests for the modules takes ages, therefore the CI crashes from time to time....

RE-Statistics

Hi @candidechamp,

I just checked out your branch exchanges, as I was cleaning up some branches. Just for your information, if you want to further get information on the RE-statistics.

You can check out this Jupyter-Notebook, as I mentioned before: https://github.com/rinikerlab/reeds/blob/main/examples/RE_EDS_s_optimization_RTO.ipynb

Here you can see how to the code calculates RoundTrips, replica-flows etc.

number of roundtrips

As we discussed recently, the final sopt analysis summary plot is not ideal at the moment, namely the "Total number of roundtrips" plot. The simulation time of the sopt runs increases with each iteration (500ps, 1000ps, 1500ps, 1500ps, ...) and this is not taken into account in the total number of roundtrips plot.

I had a look at the code today, and what the plot actually shows is not the total number of roundtrips, but the total number of replicas which perform at least one roundtrip. As I see it there are two options:

leave the plot as it is now, but change the title to something like "number of replicas performing at least one roundtrip"
change the plot so it shows the total number of roundtrips per ns

@schroederb @candidechamp what do you think? If you let me know which solution you prefer (or you have a different idea for a metric that would be interesting for the summary plot), I can make the corresponding alterations to the code.

Introduce a global settings file (ported from gitlab)

by @schroederb :

For the pipeline it would be useful to have a global setting File/Data base.
This would allow us to easily communicate between the different jobs, transfering intermediate results, provide paths, checking pipeline status (which job was successful) etc.
I would suggest to generate/modify this as a file automatically during the pipeline and not required as user input.
It does not need to be human readable.

Change - Preparation of s-optimization input (ported from gitlab)

by @candidechamp :

The code converting the output of the energy offset analysis seems to produce .imd files which don't have the proper number of s-values for s-optimization step.

Right now, the code ensures that if you had 11 s-values at the end of the energy offset estimation (uniformly distributed on a log scale), then you should also have 11 s-values for the first step of the s-optimization.

However, now that we are using the SSM approach, it will enforce that we have N s = 1 replicas (where N is the number of end states). The code therefore redistributes the 11 - N other s-values uniformely, which means that the s-distribution of lower s-values is smaller. (In my specific example with CHK1, N = 5, so we are left with only 6 replicas with s != 1).

I think it would be better to add the replicas with s = 1 such that we keep all of the other s-values just as they are. We would therefore keep the same s-distribution, and add (N-1) replicas for s = 1.

What do you think @schroederb @SalomeRonja @epecora ? This is what seems intiutive to me (and that's why I file it as a bug), but maybe you think differently? Reducing the number of s-values also in principle means we need to do even more s-optimization.

I will fix this I you give me the thumbs up that it is the functionality that we want to have!
(I had planned to change this part of the code anyways, as this s-distribution would have to change if we find out that using a different initial s-distribution could lead to a shorter s-optimization phase).

Bug - Potential thresholds are not correctly used.

I went over the code again. and the way, we are using the potential threshold in the new pipeline is not how I did it in the past work.
We need to change it as the way it is done right now it is wrong in this context.

@candidechamp : we can not determine the thersholds in the eoff. they should be determined before.
@SalomeRonja : we need to find a way to pass the different types of pot_tresh. (physical - undersampling)

I will take care of it.

state sampling thresholds in vacuum

I just ran the single state optimization for one of my vacuum system, and I noticed that some of the physical sampling thresholds are nans. The function which determines the sampling thresholds gives a nice warning for this:

"A state potential threshold was NaN -> This hints on that you did not sample the state as a dominating one! Please check your simulatigons!"

In case of the vacuum simulations, even with biased energy offsets, I always get the same dominating states:

0
State e1 - Domination state sampling fraction  0.0
State e2 - Domination state sampling fraction  0.1715
State e3 - Domination state sampling fraction  0.8185
State e4 - Domination state sampling fraction  0.0
State e5 - Domination state sampling fraction  0.0
State e6 - Domination state sampling fraction  0.0
[nan, -124.07254812657001, -101.61868126581481, nan, nan, nan]
/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/analysis/sampling.py:198: UserWarning: A state potential threshold was NaN -> This hints on that you did not sample the state as a dominating one! Please check your simulatigons!
  warnings.warn("A state potential threshold was NaN -> This hints on that you did not sample the state as a dominating one! Please check your simulatigons!")
1
State e1 - Domination state sampling fraction  0.0
State e2 - Domination state sampling fraction  0.1748
State e3 - Domination state sampling fraction  0.8152
State e4 - Domination state sampling fraction  0.0
State e5 - Domination state sampling fraction  0.0
State e6 - Domination state sampling fraction  0.0
[nan, -126.3725692960204, -100.62956354681052, nan, nan, nan]
2
State e1 - Domination state sampling fraction  0.0
State e2 - Domination state sampling fraction  0.0
State e3 - Domination state sampling fraction  0.99
State e4 - Domination state sampling fraction  0.0
State e5 - Domination state sampling fraction  0.0
State e6 - Domination state sampling fraction  0.0
[nan, nan, -180.5018617695226, nan, nan, nan]
3
State e1 - Domination state sampling fraction  0.0
State e2 - Domination state sampling fraction  0.1733
State e3 - Domination state sampling fraction  0.8167
State e4 - Domination state sampling fraction  0.0
State e5 - Domination state sampling fraction  0.0
State e6 - Domination state sampling fraction  0.0
[nan, -124.43284366532423, -100.37143183639859, nan, nan, nan]
/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/analysis/sampling.py:198: UserWarning: A state potential threshold was NaN -> This hints on that you did not sample the state as a dominating one! Please check your simulatigons!
  warnings.warn("A state potential threshold was NaN -> This hints on that you did not sample the state as a dominating one! Please check your simulatigons!")
4
State e1 - Domination state sampling fraction  0.0
State e2 - Domination state sampling fraction  0.174
State e3 - Domination state sampling fraction  0.816
State e4 - Domination state sampling fraction  0.0
State e5 - Domination state sampling fraction  0.0
State e6 - Domination state sampling fraction  0.0
[nan, -124.66017732804731, -100.683085612474, nan, nan, nan]
/cluster/work/igc/wsalome/code/reeds/reeds/function_libs/analysis/sampling.py:198: UserWarning: A state potential threshold was NaN -> This hints on that you did not sample the state as a dominating one! Please check your simulatigons!
  warnings.warn("A state potential threshold was NaN -> This hints on that you did not sample the state as a dominating one! Please check your simulatigons!")
5
State e1 - Domination state sampling fraction  0.0
State e2 - Domination state sampling fraction  0.1738
State e3 - Domination state sampling fraction  0.8162
State e4 - Domination state sampling fraction  0.0
State e5 - Domination state sampling fraction  0.0
State e6 - Domination state sampling fraction  0.0
[nan, -124.65177166175785, -100.61310471674739, nan, nan, nan]

This is low priority, but something to keep in mind. I'll have to see what happens when I use a higher bias for the energy offsets, or handle the threshold detection differently in case of vacuum.

Bug: Physical Sampling threshold detection

There is a bug in the code used to detect the thresholds which we use to determine whether a state is sampled or not (step A of the pipeline).

In short, the function physical_occurence_potential_threshold_distribution_based() found in reeds/reeds/function_libs/analysis/sampling.py does a pre-filtering of the data based on the maximally contributing sampling definition, but does not include the energy offsets in the calculation which means the states that are said to be maximally contributing are not well chosen.

TO DO: The way to solve the problem is very simple, give as additional input the energy offsets and include them in the calculation.

I do think however that it would be nice to spend a bit of time to clean up the way code calculating "sampling" is handled in the pipeline (i.e. having a single function to determine things to avoid having two pieces of code doing the same thing slightly differently at two places like here).

Bug - Job Preparation lower bound analysis - MULTIBATH (ported from gitlab)

by @candidechamp

I got a bug in the input preparation of the lower bound analysis of my PNMT runs in complex (which also include a cofactor).

The bug is that the way the MULTIBATH block created was wrong. As a temporary way to bypass this, I just copy pasted the values we want in that block into the "template_eds.imd" file. I took the correct values from my optimized_state step (as the imd was generated fine with the cofactor there).

The code that is supposed to assign the correct values in the imd file is in the following file: reeds/reeds/function_libs/pipeline/module_functions.py.

I commented out the line number #176 imd_file.MULTIBATH.adapt_multibath(last_atoms_bath=temp_baths)
which basically didn't overwrite what is in the template.

But it would be good to fix the code which sets the variable temp_baths to the correct values.
This code is just above in the file ~ lines 140-160. But there are various if/else statements so I'm not exactly sure what all the different cases could be here, and I don't want to introduce another bug by trying to fix it.

Thanks.

edit: The code preparing the input file for the optimized states did the right thing, so maybe its possible to tweak the code for step b in the same way?

rinikerlab / reeds Goto Github PK

reeds's Introduction

REEDS

Structure

Code

Install

Copyright

Acknowledgments

reeds's People

Contributors

Stargazers

Watchers

Forkers

reeds's Issues

Recommend Projects

Recommend Topics

Recommend Org