mobleylab / benchmarksets Goto Github PK
View Code? Open in Web Editor NEWBenchmark sets for binding free energy calculations: Perpetual review paper, discussion, datasets, and standards
License: BSD 3-Clause "New" or "Revised" License
Benchmark sets for binding free energy calculations: Perpetual review paper, discussion, datasets, and standards
License: BSD 3-Clause "New" or "Revised" License
Can we add Markdown tables for the CB7 and GDCC sets like we have for the CD sets? Those are really helpful.
I realize this information is in the manuscript itself, but when setting up calculations on the entire set of systems, it's way easier to use the Markdown (or just csv
) tables than the PDF. For example, for each file I'm processing, I can fairly easily write a function to parse the tables and return host, guest, and store the experimental binding affinity for later analysis. Even better would be to list host and guest residue names, along with charge (although I should be able to get that from the SMILES without too much difficulty, but having the charge listed directly would avoid dependencies on e.g., OpenEye or other chemistry-parsing code here and help ensure everyone starts with the same exact state).
I also realize I could submit a PR myself -- and it's on my to-do list -- but by listing it here, someone might take a stab at it before it surfaces to the top for me.
The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works. We should provide these since (a) it's the right thing to do, and (b) we want to be consistent with the other data sets.
(This was mentioned in #47 here #47 (comment) but I'm creating an issue so we don't forget.)
I think we should probably move towards a model where all ligands (or guests) in each benchmark set have an appropriate, unique, paper-specific numerical compound ID, rather than the current model where this is dependent on what set we're looking at. For example:
@GHeinzelmann @nhenriksen - thoughts? My preference I think is to make sure every set has a unique numerical compound ID in the tables and that this is used for all of the relevant files.
@nhenriksen - we just got Germano's bromodomain stuff merged in, so would now be a good time for you to proceed towards getting your cyclodextrin stuff merged in as well? It seems like it will likely become important to do so fairly soon since it's potentially being used for the Yank paper (cc #41 ). Since we're done working on the bromodomain stuff now it should be possible to proceed towards merging without having to resolve conflicts multiple times.
Or, are there other things you need to do first?
Since the community is welcome to edit/propose changes to the paper, I perhaps should split out the major sections into separate TeX files to make it easier to deal with multiple changes at once without editing clashes.
On the other hand, maybe this will make it harder since people will have to figure out which file they need to edit.
Consider using Zenodo to make releases of this get permanent DOIs, as per https://guides.github.com/activities/citable-code/
The reference for the binding free energy of catechol to T4 L99A/M102Q in Table VIII is indicated to be [224]. I took a look at the original paper to figure out buffer conditions, but I couldn't find a reference to ITC measurements with catechol (see Table 3 in [224]). Also, in the benchmark sets paper, catechol is the only compound coming from [224] with an error associated with the experimental binding free energy.
Is it possible that this is a typo? It could be that the value comes from [19] instead.
From Nascimento on Biorxiv:
It seems that there is a mismatch in on of the lysozyme T4 (M102Q) complexes cited in table VI. The crystal structure 2RBO does not contain the n-phenylglycinonitrile as a ligand. Instead, 2-nitrothiophene is the binder there. So, the PDB code should be 2RBN to correctly point to the complex between T4 Lys M102Q and n-phenylglycinonitrile. Just to 2c for this very interesting paper!
Gilson got BindingDB updated to include the HG data in this set, so we should update with the link: http://bindingdb.org/bind/HostGuest.jsp
What formats?
Currently it is in LaTeX tables, but we should probably also provide .csv and .json.
Basically, everything we provide that anyone might want ought to be easily available in convenient electronic formats
This is the computational analog of #2 -- one would like to make it easy to do new studies on existing benchmark systems for a variety of possible tests as detailed in the paper, including things like:
etc.
We need to plan how to facilitate this. We'll need to sort out how to make available computational materials - structures, input files, etc. Ultimately, we will likely even want a way to specify specific order parameters to analyze for convergence, etc. (e.g., something machine-handleable which can tell automated analysis to be sure to check sampling of Val103 in lysozyme L99A).
@nhenriksen @GHeinzelmann - I notice both of you reference (in your markdown files, which are great!) specific table numbers in the paper. I wonder if we should be referencing tables by title rather than by number. Otherwise, if we change things in the paper such that tables auto-renumber, then the table references will all be wrong and someone will have to fix them. If we just refer to them by title then we won't have to remember to change.
Thoughts?
An interesting question is whether it is possible to facilitate new experiments on existing benchmark systems. Specifically, could we make it easy to access the necessary materials for new experiments? For example, one could imagine being able to lay out for host-guest systems that one should buy this host and this guest from this supplier.
Perhaps there are vendors who would participate in this, or perhaps even NIST could provide standard reference materials?
I'm dropping a note here to mention that reading in cyclic molecules with non-unique atom names can cause an error with ff.createSystem
as detailed here. I'll update this issue once we decide on a robust solution.
To move this in the direction of helping people benchmark, we should provide calculated values from gold standard calculations with the provided files, when available. These should be in a markdown file in the relevant directory, I think.
@nhenriksen - is this something you're able to add? I think you have values for all of the files you've deposited?
@GHeinzelmann - I think you may not?
@Janeyin600 - do you?
At some point we'll actually need to repeat the lysozyme calculations (or another group will) and get input files for those, and calculated values, in here as well.
We will probably want to provide some additional data to accompany the existing benchmarks already noted in the paper in order to facilitate new science beyond the benchmarks proposed:
The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host.
Per Niel:
They are close [to bound], but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.
I now have a Jupyter notebook I've prepared for SAMPL6 which can dock guests to hosts, so we should be able to re-generate these files from compound isomeric SMILES strings. It'll just take me a bit of time to get to that.
I need to adjust the paper text to mention the availability of the host-guest data sets added in #22 .
Dear Mobley's team,
This is not really an issue and more like a request. I am using the host-guest system for my research and I need Experimental Delta H results. I realized that Delta H results are missing for gdcc-set2, CB set1 and CB set 2 and I was wondering if you will be able to provide it. I appreciate your support on this.
Best regards,
Sahar
In a recent round of edits on bromodomain ligands, Mike Gilson suggested:
Would it make sense to add some info to the table providing the references for the computational papers for each ligand to date? On the other hand, this would deviate from what we are doing in the other benchmark data tables...
@GHeinzelmann - what do you think of this? Not for BRD4(1) in specific, but should we be perhaps compiling supplemental data (perhaps in markdown files that other people can easily edit) for each benchmark set that lists all of the studies of each ligand, perhaps by DOI, maybe also with a spot where people can remark on key insights from each study?
This would provide a way for the community to effectively add notes to this repo on what they think has been shown in the literature; then we could link to it from the paper but it wouldn't be part of the paper itself.
@nhenriksen - we were trying to work with the cyclopentanol (guest 4) example from CD set 1, and noticed that the AMBER prmtop file has mixed 1-4 scaling factors (SCEE). Was this intended and, if so, can you explain why?
This makes conversion into some code bases (GROMACS for example) impossible, and I've also never seen it before, so I am very curious where this came from/why it's done here.
Thanks.
cc @elkhoury
Hi all, we'd like to use the CD input files to run YANK calculations. In particular, we'd like to start from the .mol2
files currently in the nieldev
branch to prepare our solvation boxes in TIP4P-EW waters. A couple of questions (tagging @nhenriksen who is working the branch):
mol2
files already have the same protonation state/charges that were used in the reference calculation?mol2
to be the same as the final rst7
file so that the guest will be in the binding pocket? I can work on this myself in case you don't have time but you are still interested. I'll have to do it anyway in the next couple of days to set up my simulations.Ultimately, we want to have computational data available for benchmark systems to make it easy for new researchers to reproduce and then learn from (by building on or deviating from) the work of previous researchers. To facilitate this, we need to sort out more guidance in terms of how such computational data should be made available. Ideally, I think it would be made available in a way such that if you wanted to begin studies on my system, you could do it automatically given my archived data files, without even having to have a human being read a set of README files.
To make this possible, we need to decide what data we would provide and how.
At one point I started a Google Doc for discussion of how we could make this happen, and I need to resurrect that and get discussion going again here and elsewhere.
#22 added an extensive set of host-guest input files for the host-guest sets from the paper, courtesy of Jian Yin from the Gilson lab. I need to adjust the README she kindly provided into a README.md, and add a manifest of what files were added and how they were organized.
As suggested by @slochower, I should label the residues in the T4 binding site figure for the discussion on p. 11, right column.
Include a set of bullet points describing what should be documented about a benchmark system and how to document it
Originally, I'd planned to write material on what new benchmark systems are most needed (i.e. what attributes they should have or problems they should exemplify -- water sampling problems, for example?) but I ran out of space before getting to this.
Perhaps this should still be done; input will be helpful.
The BRD4-*.pdb
files in this directory are missing the required CONECT
records for the small molecule ligands.
As a result:
CONECT
records for nonstandard residuesI just want make sure you've noticed this issue ParmEd/ParmEd#898 . Briefly, manipulating the cyclodextrin mol2
files with parmed
results in a ring breaking. A work-around would be assigning a single residue number to all cyclodextrin atoms (currently 7 for beta-CD and 6 for alpha-CD).
@davidlmobley, if somebody in your group has run cyclodextrin calculations with YANK using non-OpenEye charges, this bug surely affected the setup.
example, butylammonium: https://github.com/MobleyLab/benchmarksets/blob/master/input_files/cd-set1/sdf/guest-1.sdf
It would be good to develop a set of criteria a benchmark system should typically meet in terms of data quality, structure availability, etc.
Originally, I thought that we would be able to develop a universal set of such criteria (i.e. high quality structures of such-and-such a resolution, ITC or SPR binding affinities, etc.), but then as the paper developed we realized that different types of data are needed depending on the purpose of a test, as in Section II.A ("hard" and "soft" benchmarks). So, it may not be that we can provide a universal set of criteria -- but it would be good to discuss criteria that might apply in the different categories.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.