I opened this issue to discuss if and how to split the new Lomap code developed so far

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can you elaborate, <a class="user-mention notranslate" data-hovercard-type="user" data

Code splitting,about mobleylab/lomap

Comments (23)

halx commented on July 23, 2024

You may not want to hear this but I suggest a complete redesign. The old version had a few weaknesses but overall it seemed to have a useful design.

What you need is a simple container for the individual molecules, the molecules can be held together in a simple list. The central data type should be holding the pair-wise data like MCS (this may need some extra thoughts regarding parallelisation). This could be done with a numpy array holding references to a simple struct like container.

The central piece of the code is the score computations for the similarity matrix. The scores are determined by a set of rules. While a rules engine can be (and was) helpful here it appears that the rules needed can basically be applied in sequence. A pattern that can be use here is the chain of responsibility (see e.g the simple Python version at https://en.wikibooks.org/wiki/Computer_Science_Design_Patterns/Chain_of_responsibility). The crucial point here is that the sequence of rules have to be set explicitly by the programmer (rather than the hard-coded setup we have now) and it is immediately obvious what the rules are. If necessary the rules could also be set by the user. New rules can be added with ease. Knock-out rules like the charge rule can also easily short-cut the sequence.

Does this make sense?

from lomap.

davidlmobley commented on July 23, 2024

@halx - we discussed this and it makes some sense, especially given that we'll likely want to customize the rules, etc., in the future. However, Gaetano DID just finish rewriting it from scratch from the previous version which was much more cumbersome and a lot longer. So he's not super enthusiastic about sitting down and completely rewriting it again, even though it makes sense to do so. :)

So, it seems a key question now is how much involvement you want to have. If we're able to lay out what needs to be done overall to rewrite/modularize/etc and you want to take a reasonable chunk of the work, then Gaetano can probably be talked into doing another reasonable chunk of it with more enthusiasm than if we try to get him to do the entire thing.

Thoughts?

Thanks.

from lomap.

halx commented on July 23, 2024

I will meet Julien tomorrow but our thinking so far was that Lomap is very important to us. So I expect to have time to work on this and be able to take a large part on me. The one thing that you should make sure is that the logic of the rules is sane. No matter what the precise new implementation or a future one will be, the rules code will basically stay the same and only be wrapped into something new.

from lomap.

davidlmobley commented on July 23, 2024

@halx

So I expect to have time to work on this and be able to take a large part on me.

Sounds great! I agree that it's very important not just for you guys but for the community, and we should be thinking about a framework that will easily be extensible to more complicated scenarios in the future.

The one thing that you should make sure is that the logic of the rules is sane.

Did you have concerns about any of it? We've gone over these rules rather a lot, in that this also formed the basis of what Schrodinger did internally in what became FEPMapper, so in some sense the rules are extremely well validated, though that of course doens't guarantee anything.

from lomap.

halx commented on July 23, 2024

I have now sketched out a prototype to see if my initial ideas are workable, The code is still very rough and has a few ugly spots but I am getting there.

There was only one concern about the rules but that has been sorted out in #5. I just still have to think how to best approach the strict vs loose rule. What is your current thinking about that? In the ring breaking paper you seem to say that rings should be broken, at least under the circumstances you described there?

from lomap.

davidlmobley commented on July 23, 2024

@halx - do you want to submit a PR with a title like "[WIP] Prototype of splitting code up as in #3" so we can have a look at what you're doing? (The "WIP" means "work in progress", i.e. you're not done yet.)

In terms of strict vs loose - we do NOT currently want to break rings at all, as doing so yields relative free energies which are no longer rigorously correct in the limit of adequate sampling, which is BAD. However, our work suggests that errors introduced by breaking rings are not that large except in the case of bridged ring systems. So this means that yes, we always prefer planning via the "strict" algorithm, which does not allow ring breaking, but when absolutely necessary in order to have a connected graph, we can fall back to the loose scheme. So it's not that rings SHOULD be broken, but that we CAN break them when we have to.

We're experimenting with some ideas for how to do ring breaking in a rigorous way (i.e. using morse potentials) but we do not have these working yet.

from lomap.

halx commented on July 23, 2024

The new code is in the refactor branch. I think a PR makes only sense when we have decided that the new approach is really of advantage. I do not have a problem throwing the code away and start over ;-)

Have you ever thought about AMBER's approach to this? AMBER doesn't need explicit dummies and so there's "nothing" that can distort geometries.

from lomap.

davidlmobley commented on July 23, 2024

Can you elaborate, @halx ? I have a lot of issues with what AMBER does in general for free energy calculations but I'm probably missing some virtues.

The way we usually use a [WIP] pull request is to trigger discussion of changes which are being made somewhere which might or might not actually get incorporated - i.e. it's just a way to link discussion to proposed/possible code changes permanently, a bit of a GitHub trick if you will. Right now if we want to comment on something in your branch there is no issue or pull request which links our commentary to your code changes...

from lomap.

halx commented on July 23, 2024

The AMBER implementation is principally a dual topology one which means that all atoms that need to be transformed need to be present in duplicate and tagged as "TI region" in the input file. Softcore atoms must be tagged too. The difference of those two sets are the atoms that should be directly and linearly transformed (must be in the same order and have same coordinates). This happens by mixing the energy as (1-lambda).V0 + lambda.V1 (as expected) which results in a single force/atoms so basically making that single topology. Bonded terms involving at least one softcore atoms are effectively discarded (well, you get some energies computing from those but they are not used for anything beyond). Otherwise, I guess, softcore potentials/gradients are computed as expected.

from lomap.

davidlmobley commented on July 23, 2024

I'm not sure I totally understand that. I've always found the AMBER way of doing these things very confusing. Do you know if this is described in detail, with cartoons, etc. anywhere?

from lomap.

halx commented on July 23, 2024

I do not know of any documentation other than the AMBER manual. In AMBER speak there are two approaches: "dummy" and "softccore".

The "dummy" approach is basically the "classical" single-topolgy one as introduced into AMBER by DAP 25+ year ago (but implementation is dual-topology with energy-mixing). Dummy atoms are hardcore atoms with all the problems to be expected at the end-state. There is some non-linear scaling possible though, basically by raising lambda to a power k.

The "softcore" approach means that a softcore potential is used for "appearing"/"disappearing" atoms but with the "twist" that there is no counter-part in the other state (they call those "unique atoms"). Thus contributions from bonds, angles, dihedrals and 1-4 terms involving at least one softcore atom are simply ignored. I guess, this is like saying that if you did it with explicit dummy atoms, any dummy contributions would perfectly vanish in the thermodynamic cycle (that's why I am wondering what this would mean for ring-breaking).

The "dummy" and the "softcore" approach can be combined which then means running the simulation like Gromacs does. In fact, I have tested this and do see differences of up to 0.3 kcal/mol in the reproducibility study but I have to review the results to be sure there are not other causes for this.

In any case, it would be very interesting to see what happens with ring-breaking in the AMBER approach. But I guess we are going a bit off-topic now...

from lomap.

nividic commented on July 23, 2024

I was checking the code spitting and I have some comments.

(1) The starting implementation was defining a class container (DBMolecules) which was holding all the molecules. On this container was possible to send messages like: compute the similarity matrices, build the graph etc. Although this class was a little bit overloaded of work, I liked more that design than the new one where there is just a normal python list as container to hold the molecules. Operations on the list are then performed by using other classes e.g. the similarity score is calculated by using the class Rule and other operations will be presumably performed by other classes. In my opinion, it was semantically more correct and easier the original code. The class Rules and the other classes that will need to manage the database look to me be unlinked from the database apart from passing it as function argument. In my opinion if we carry on in this way the code will be too fragmented again, but may be I’m wrong and I’m still biased from my original design.

(2) A possible way to clean up the original code (DBMolecules) could be for example for the rule part to define a method where you add rule objects to the container i.e. DBMolecules.addRule(rule name) and define all the rules in a separate file like the current rules.py. We could define a Rule virtual superclass and then define sub classes for the specialised rules.

(3) We should also add a mechanism, which stops the rule evaluations if for one of them the rule score is zero. This makes a lot of difference in terms of computational time. In my code a put out the charge rule for this reason and there was a lot of difference in terms of computational time processing large molecule data set containing molecules with different charges.

(4) To hold the similarity score I defined a symmetric matrix container derived from numpy. This was saving space and to the user was possible to use it like a normal matrix. Effectively I don’t know if we should keep it, usually we do not process a lot of compounds and to maintain the class is a little bit mathematically tricky to understand, so if you want to remove it I’m ok.

from lomap.

halx commented on July 23, 2024

I think you describe the problem with your design quite well. Your thinking appears to be that the problem at hand is to keep a set of molecules in a database and as such attempt to structure all code around it. The most obvious problem for me as a programmer is that I have to access almost all important functionality through a single class. How am I going to extend this? Where is the flexibility in this design e.g. if I want to exchange a certain key part of the code?

My approach is to dissect this into individual components which can then be conveniently be replaced where needed (separation of concerns, single responsibility, encapsulation, et.c) without affecting other parts of the code. The basic tasks are reading molecules, build a similarity matrix through a set of rules (the central part of the code but should be separate from the matrix) and compute the MST/SPT. The molecule reader is an example, the rules engine still needs to get there. The choice of simple data types like lists is because that is understandably to even a beginner. Other bits of my code need some rethinking esp. decoupling, of course. But what I have sketched out so far is to show the principle.

This all is not "fragmenting" the code but rather basic design principles. The old code followed those principles as well except for a few flaws and oddities.

from lomap.

davidlmobley commented on July 23, 2024

One thought we had originally when working on the first version of this is that certain aspects should be kept separate from dealing with molecules - i.e., a computer scientist who wants to get involved with the graph theory aspects relating to planning which nodes should be connected should not have to know anything about molecules other than that we have some measure of molecular similarity.

@halx - are you saying there are other issues that can cleanly be separated from knowing about the molecules?

And @nividic - do we still have that separation of the graph planning part into something which can operate independently of knowing about molecules?

from lomap.

halx commented on July 23, 2024

What I am trying to say is that no code should depend on a database. The rules engine should only get two molecules (which currently are RDKit mols) from "somewhere" and report a score and also the MCS because we need it later. Molecules will need a form of ID for labelling in the planning graph (see discussion in #7) and possibly other uses downstreams.

from lomap.

nividic commented on July 23, 2024

@halax

I would like to have a central container of molecules that "coordinates" the different things to do e.g. create matrix, build graph etc. This does not mean that we cannot "split" this operations by using other classes. How the user should use the new APIs? Reading the molecules using one class, then use another class for the rules, then another class for the graph etc passing the molecule list as argument. Having one container allow the user to send messages just in one place.

@davidlmobley

The graph part is totally separated but it needs to know about the molecule database because it points to the strict and loose matrices

from lomap.

halx commented on July 23, 2024

@nividic : I disagree. The API user will want to decide how to coordinate individual components. Your design still suffers from tight coupling and all the other things I have mentioned. The graph code is obviously not "totally separated" because it needs to have intimate knowleded of how the database class works.

Do you actually know what all these terms I have used so far mean and why they matter? If you want to argue design we will have to do it with that language.

from lomap.

davidlmobley commented on July 23, 2024

@halx :

What I am trying to say is that no code should depend on a database. The rules engine should only get two molecules (which currently are RDKit mols) from "somewhere" and report a score and also the MCS because we need it later. Molecules will need a form of ID for labelling in the planning graph (see discussion in #7) and possibly other uses downstreams.

I think I am OK with this, @halx , though the other way that makes sense to me is for the rules engine to be something which makes updates to a score matrix based on interacting with the pairs of molecules corresponding to the entries in the matrix. Is there a strong reason to prefer the former?

I do agree though that it probably doesn't make sense for the rules engine to be interacting with all of the molecules at once, though, since each operation only involves a pair of molecules. But again remember I'm saying this without knowing anything about the details of the current code.

@nividic :

The graph part is totally separated but it needs to know about the molecule database because it points to the strict and loose matrices

If it needs to know about the molecule database, then it's not totally separated. The graph part should ONLY need to know about the score matrices which should be entirely separate from the database of molecules. Again, imagine I was working with someone in CS who thought they had a better way of dealing with the graph theory here. The only things I should need to give them are:

Final similarity score matrix for loose case
Final similarity score matrix for strict case
possibly 3) Labels corresponding to each entry in the matrix (for convenience)

They shouldn't need to even know these things correspond to molecules, what a molecule IS, or understand anything about chemistry. It's a pure graph problem. If they have to interact with a molecule database to retrieve anything then it's not totally separated, and no longer a pure graph problem.

I don't know that much about software design, but I do know that we want logical units to not get coupled together in ways that require info they don't really need.

from lomap.

nividic commented on July 23, 2024

@davidlmobley

Indeed the graph part needs just (1), (2) and (3) info (you also need cutoff etc.) but in the implementation I'm extracting them from the database. No info about rdkit molecules are required. I did not modified more or less anything from the original code here...just adapted

from lomap.

nividic commented on July 23, 2024

In this case the main container needs to do three things: read, calculate matrices and calculate the graph. From a user point of view in my opinion it is more easy to create one objects sending it messages than more objects that need to cooperate together. It could be not immediate to understand the role of the different objects...may be I'm wrong but for a so small problem I would prefer this approach.

from lomap.

davidlmobley commented on July 23, 2024

Indeed the graph part needs just (1), (2) and (3) info (you also need cutoff etc.) but in the implementation I'm extracting them from the database. No info about rdkit molecules are required. I did not modified more or less anything from the original code here...just adapted

OK, so I agree that these should not be extracted from a molecular database, just passed in via some simple structure (arrays and a list, most likely). Otherwise the internals of the graph part need to know about the database class structure, which is bad.

In this case the main container needs to do three things: read, calculate matrices and calculate the graph. From a user point of view in my opinion it is more easy to create one objects sending it messages than more objects that need to cooperate together. It could be not immediate to understand the role of the different objects... may be I'm wrong but for a so small problem I would prefer this approach.

I think the key point is in bold above. Right now this is a very small problem, but it's going to become a whole research area, so we need to modularize things enough that they work independently and can be tested independently.

So I think I'm with Hannes here.

from lomap.

halx commented on July 23, 2024

@halx :

What I am trying to say is that no code should depend on a database. The rules engine should only get two molecules (which currently are RDKit mols) from "somewhere" and report a score and also the MCS because we need it later. Molecules will need a form of ID for labelling in the planning graph (see discussion in #7) and possibly other uses downstreams.

I think I am OK with this, @halx , though the other way that makes sense to me is for the rules engine to be something which makes updates to a score matrix based on interacting with the pairs of molecules corresponding to the entries in the matrix. Is there a strong reason to prefer the former?

Not sure, I understand the question but can we discuss this in #8 ?

from lomap.

ppxasjsm commented on July 23, 2024

Can this issue be closed? @nividic @davidlmobley

from lomap.

Code splitting about lomap HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent