mobleylab / benchmarksets Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 16.0 132.97 MB

Benchmark sets for binding free energy calculations: Perpetual review paper, discussion, datasets, and standards

License: BSD 3-Clause "New" or "Revised" License

TeX 84.98% Python 15.02%

benchmarksets's People

Contributors

Stargazers

Watchers

Forkers

slochower brjagger yongwangcph andrrizzi renm1 diegoenry guoweiqi jchodera int-zero bwang-ecnu angusezhang djhuggins saharctech snyga lmacaya yychuang

benchmarksets's Issues

Add machine-readable tables for all the sets

Can we add Markdown tables for the CB7 and GDCC sets like we have for the CD sets? Those are really helpful.

I realize this information is in the manuscript itself, but when setting up calculations on the entire set of systems, it's way easier to use the Markdown (or just csv) tables than the PDF. For example, for each file I'm processing, I can fairly easily write a function to parse the tables and return host, guest, and store the experimental binding affinity for later analysis. Even better would be to list host and guest residue names, along with charge (although I should be able to get that from the SMILES without too much difficulty, but having the charge listed directly would avoid dependencies on e.g., OpenEye or other chemistry-parsing code here and help ensure everyone starts with the same exact state).

I also realize I could submit a PR myself -- and it's on my to-do list -- but by listing it here, someone might take a stab at it before it surfaces to the top for me.

Provide suitable README files for CB7, GDCC benchmarks

The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works. We should provide these since (a) it's the right thing to do, and (b) we want to be consistent with the other data sets.

(This was mentioned in #47 here #47 (comment) but I'm creating an issue so we don't forget.)

Provide compound ID for all files

I think we should probably move towards a model where all ligands (or guests) in each benchmark set have an appropriate, unique, paper-specific numerical compound ID, rather than the current model where this is dependent on what set we're looking at. For example:

CB7 Tables 1&2: Has unique CID we assigned
GDCC Tables 3: Has unique CID we assigned, but will get broken if we want to provide structures docked into hosts as there are two hosts but only one set of compound IDs
GDCC Table 4: Has unique CID we assigned
CD Table 5 and 6: Has unique CID we assigned
lysozyme Tables 7 and 8: No CIDs, uses compound names only
BRD4(1) Table 9: Uses heterogeneous identifiers -- "Compound 4", "alprazolam", "Bzt-7", "JQ1(+)" etc.; this is probably the worst offender since some of these are pretty unsuitable as filenames due to special characters and/or spaces (e.g. some tools can't load files with spaces in their filenames and/or handle some of these special characters).

@GHeinzelmann @nhenriksen - thoughts? My preference I think is to make sure every set has a unique numerical compound ID in the tables and that this is used for all of the relevant files.

Get CD benchmarks in?

@nhenriksen - we just got Germano's bromodomain stuff merged in, so would now be a good time for you to proceed towards getting your cyclodextrin stuff merged in as well? It seems like it will likely become important to do so fairly soon since it's potentially being used for the Yank paper (cc #41 ). Since we're done working on the bromodomain stuff now it should be possible to proceed towards merging without having to resolve conflicts multiple times.

Or, are there other things you need to do first?

Split out sections into separate TeX files for better editing

Since the community is welcome to edit/propose changes to the paper, I perhaps should split out the major sections into separate TeX files to make it easier to deal with multiple changes at once without editing clashes.

On the other hand, maybe this will make it harder since people will have to figure out which file they need to edit.

Connect with Zenodo for citeable DOIs for versions of this repo

Consider using Zenodo to make releases of this get permanent DOIs, as per https://guides.github.com/activities/citable-code/

Typo in reference for Lysozyme L99A/M102Q - catechol experimental value?

The reference for the binding free energy of catechol to T4 L99A/M102Q in Table VIII is indicated to be [224]. I took a look at the original paper to figure out buffer conditions, but I couldn't find a reference to ITC measurements with catechol (see Table 3 in [224]). Also, in the benchmark sets paper, catechol is the only compound coming from [224] with an error associated with the experimental binding free energy.

Is it possible that this is a typo? It could be that the value comes from [19] instead.

Fix PDB code in table

From Nascimento on Biorxiv:

It seems that there is a mismatch in on of the lysozyme T4 (M102Q) complexes cited in table VI. The crystal structure 2RBO does not contain the n-phenylglycinonitrile as a ligand. Instead, 2-nitrothiophene is the binder there. So, the PDB code should be 2RBN to correctly point to the complex between T4 Lys M102Q and n-phenylglycinonitrile. Just to 2c for this very interesting paper!

Include link to BindingDB host-guest data which is now up

Gilson got BindingDB updated to include the HG data in this set, so we should update with the link: http://bindingdb.org/bind/HostGuest.jsp

Put raw data for paper tables here in a more easily accessible format

What formats?

Currently it is in LaTeX tables, but we should probably also provide .csv and .json.

Basically, everything we provide that anyone might want ought to be easily available in convenient electronic formats

Attempt to associate benchmarks with readily available computational materials

This is the computational analog of #2 -- one would like to make it easy to do new studies on existing benchmark systems for a variety of possible tests as detailed in the paper, including things like:

Test a new method on systems studied with an existing forcefield and method
Test a new forcefield
Cross-compare simulation packages
Test sampling methods

etc.

We need to plan how to facilitate this. We'll need to sort out how to make available computational materials - structures, input files, etc. Ultimately, we will likely even want a way to specify specific order parameters to analyze for convergence, etc. (e.g., something machine-handleable which can tell automated analysis to be sure to check sampling of Val103 in lysozyme L99A).

Should we use table titles and not numbers in markdown files?

@nhenriksen @GHeinzelmann - I notice both of you reference (in your markdown files, which are great!) specific table numbers in the paper. I wonder if we should be referencing tables by title rather than by number. Otherwise, if we change things in the paper such that tables auto-renumber, then the table references will all be wrong and someone will have to fix them. If we just refer to them by title then we won't have to remember to change.

Thoughts?

Attempt to associate benchmarks with readily available experimental materials

An interesting question is whether it is possible to facilitate new experiments on existing benchmark systems. Specifically, could we make it easy to access the necessary materials for new experiments? For example, one could imagine being able to lay out for host-guest systems that one should buy this host and this guest from this supplier.

Perhaps there are vendors who would participate in this, or perhaps even NIST could provide standard reference materials?

Errors using cyclic host `mol2` files with non-unique atom names

I'm dropping a note here to mention that reading in cyclic molecules with non-unique atom names can cause an error with ff.createSystem as detailed here. I'll update this issue once we decide on a robust solution.

Deposit calculated values for provided files when available

To move this in the direction of helping people benchmark, we should provide calculated values from gold standard calculations with the provided files, when available. These should be in a markdown file in the relevant directory, I think.

@nhenriksen - is this something you're able to add? I think you have values for all of the files you've deposited?

@GHeinzelmann - I think you may not?

@Janeyin600 - do you?

At some point we'll actually need to repeat the lysozyme calculations (or another group will) and get input files for those, and calculated values, in here as well.

Add additional supporting data on host-guest and lysozyme systems

We will probably want to provide some additional data to accompany the existing benchmarks already noted in the paper in order to facilitate new science beyond the benchmarks proposed:

Provide detailed lists of additional binders/nonbinders for the host-guest benchmark systems
Provide a full list of lysozyme binders/nonbinders with references, since the Shoichet lab no longer seems to be maintaining their lists

Include additional data, references

References:

Read Lindorff-Larsen, Miao papers on lsozyme binding; include references and relevant insights. Lindorff-Larsen reference: https://elifesciences.org/articles/17505
Reference Pan paper: http://pubs.acs.org/doi/abs/10.1021/acs.jctc.7b00172
Add Minh/Xie reference on multiple binding modes in L99A (10.1021/acs.jctc.6b01183); this same paper also notes difficulty in converging Yank binding free energy calculations (section 3.3).
Add additional trypsin references -- Tiwary, De Fabritiis, Noe, Doerr, Buch, Dickson, Amaro

Data curation

Provide isomeric SMILES for all compounds outside the LaTeX
Add info on ionization states. Tables typically show 2D structures for neutral forms; files provide a guess. This needs explaining. Suggestion: "Add info to the legends of the tables with the benchmark compounds and data some information about the relationship between what is shown in the table and what is provided in the molecule files. We may state that the molecules files provide a reasonable guess for the ionization states for the free ligands, but it is ultimately up to the user to decide what they want to do about ionization states if they are trying to match experiment."
Update octa acid uncertainty information (see DLM tasks)
Clean host/guest files so guest is not multi-residue, and hosts have residue number 1.
Possibly add Pan/Xu FKBP data/example/inputs, see DLM tasks
Possibly add lysozyme inputs from Rizzi, see DLM tasks
Check carboxylic acid bond order problem (?) -- #47 (comment)
Possibly re-generate guest starting structures using docking, #50

Additional discussion/paper editing:

Discuss hydration free energies and FreeSolv (by request; readers felt this was important to mention)
Note availability of Minh/Xie curated set of lysozyme binders, which now lives on Mobleylab GitHub
Clearly delineate soft and hard benchmarks in subsection headers
Fix typo -- p8, beginning of GDCC section 2, "directly with bound hosts" -> "directly with bound guests"
Note L99A/M102H as of possible future interest.
Add Ponder insights from SAMPL7 webinar (available online on Zenodo) to paper -- particularly that changing flexibility around upper ring affects binding of guests by 4-5 kcal/mol (said about 8:34 am) and that the diphenyl ether has two coupled torsions; can’t be fit as a sum of 1D C-O torsions. So they had to use 2D torsion-torsion coupling term (!!).

Other

License writing under CC-BY since only code can be MIT
Update eScholarship links to GitHub and vise versa

Provide bound-state starting structures for hosts

The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host.

Per Niel:

They are close [to bound], but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.

I now have a Jupyter notebook I've prepared for SAMPL6 which can dock guests to hosts, so we should be able to re-generate these files from compound isomeric SMILES strings. It'll just take me a bit of time to get to that.

Update text to reflect host-guest sets added in #22

I need to adjust the paper text to mention the availability of the host-guest data sets added in #22 .

gdcc-set2 and CB experimental Delta H is missing

Dear Mobley's team,

This is not really an issue and more like a request. I am using the host-guest system for my research and I need Experimental Delta H results. I realized that Delta H results are missing for gdcc-set2, CB set1 and CB set 2 and I was wondering if you will be able to provide it. I appreciate your support on this.

Best regards,
Sahar

Add additional supplementary data, perhaps in markdown files?

In a recent round of edits on bromodomain ligands, Mike Gilson suggested:

Would it make sense to add some info to the table providing the references for the computational papers for each ligand to date? On the other hand, this would deviate from what we are doing in the other benchmark data tables...

@GHeinzelmann - what do you think of this? Not for BRD4(1) in specific, but should we be perhaps compiling supplemental data (perhaps in markdown files that other people can easily edit) for each benchmark set that lists all of the studies of each ligand, perhaps by DOI, maybe also with a spot where people can remark on key insights from each study?

This would provide a way for the community to effectively add notes to this repo on what they think has been shown in the literature; then we could link to it from the paper but it wouldn't be part of the paper itself.

Mixed 1-4 scaling?

@nhenriksen - we were trying to work with the cyclopentanol (guest 4) example from CD set 1, and noticed that the AMBER prmtop file has mixed 1-4 scaling factors (SCEE). Was this intended and, if so, can you explain why?

This makes conversion into some code bases (GROMACS for example) impossible, and I've also never seen it before, so I am very curious where this came from/why it's done here.

Thanks.

cc @elkhoury

CD mol2 files coordinates starting in the binding pocket

Hi all, we'd like to use the CD input files to run YANK calculations. In particular, we'd like to start from the .mol2 files currently in the nieldev branch to prepare our solvation boxes in TIP4P-EW waters. A couple of questions (tagging @nhenriksen who is working the branch):

Could you confirm that the mol2 files already have the same protonation state/charges that were used in the reference calculation?
Would it be possible to have the coordinates in the mol2 to be the same as the final rst7 file so that the guest will be in the binding pocket? I can work on this myself in case you don't have time but you are still interested. I'll have to do it anyway in the next couple of days to set up my simulations.

Update README.md to reflect how to cite (once discussion is final)

Resurrect my spec sheet for benchmark set data

Ultimately, we want to have computational data available for benchmark systems to make it easy for new researchers to reproduce and then learn from (by building on or deviating from) the work of previous researchers. To facilitate this, we need to sort out more guidance in terms of how such computational data should be made available. Ideally, I think it would be made available in a way such that if you wanted to begin studies on my system, you could do it automatically given my archived data files, without even having to have a human being read a set of README files.

To make this possible, we need to decide what data we would provide and how.

At one point I started a Google Doc for discussion of how we could make this happen, and I need to resurrect that and get discussion going again here and elsewhere.

Add README.md for host-guest inputs; add additional info

#22 added an extensive set of host-guest input files for the host-guest sets from the paper, courtesy of Jian Yin from the Gilson lab. I need to adjust the README she kindly provided into a README.md, and add a manifest of what files were added and how they were organized.

Update lysozyme figures with residue labeling

As suggested by @slochower, I should label the residues in the T4 binding site figure for the discussion on p. 11, right column.

Provide guidance on here about documenting benchmark systems

Include a set of bullet points describing what should be documented about a benchmark system and how to document it

Decide whether to write and include info on what new benchmarks are most needed

Originally, I'd planned to write material on what new benchmark systems are most needed (i.e. what attributes they should have or problems they should exemplify -- water sampling problems, for example?) but I ran out of space before getting to this.

Perhaps this should still be done; input will be helpful.

PDB files for BRD4 parameterized systems are missing CONECT records

The BRD4-*.pdb files in this directory are missing the required CONECT records for the small molecule ligands.

As a result:

They are not even valid PDB files, which require CONECT records for nonstandard residues
They cannot be used to build parameterized molecular systems

Parmed bug affecting cyclodextrin files

I just want make sure you've noticed this issue ParmEd/ParmEd#898 . Briefly, manipulating the cyclodextrin mol2 files with parmed results in a ring breaking. A work-around would be assigning a single residue number to all cyclodextrin atoms (currently 7 for beta-CD and 6 for alpha-CD).

@davidlmobley, if somebody in your group has run cyclodextrin calculations with YANK using non-OpenEye charges, this bug surely affected the setup.

The SDF files for charged compounds aren't listing the formal charges of the atoms

example, butylammonium: https://github.com/MobleyLab/benchmarksets/blob/master/input_files/cd-set1/sdf/guest-1.sdf

Decide what criteria benchmark systems should meet

It would be good to develop a set of criteria a benchmark system should typically meet in terms of data quality, structure availability, etc.

Originally, I thought that we would be able to develop a universal set of such criteria (i.e. high quality structures of such-and-such a resolution, ITC or SPR binding affinities, etc.), but then as the paper developed we realized that different types of data are needed depending on the purpose of a test, as in Section II.A ("hard" and "soft" benchmarks). So, it may not be that we can provide a universal set of criteria -- but it would be good to discuss criteria that might apply in the different categories.