Giter Site home page Giter Site logo

loryruta / molgena Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 1.0 11.23 MB

An attempt into Molecule Generation

Home Page: http://loryruta.dev/molgena/

License: GNU General Public License v3.0

Python 99.36% Shell 0.64%
chemistry deep-learning generative-model graph-neural-networks molecule-generation

molgena's Introduction

MOLGENA ([Mole]cular [gena]rator) 🚧

Check the technical report (dated 19 Jun 2024):

https://loryruta.dev/molgena/

Chemical properties to optimize:

  • logp (Octanol-water Partition Coefficient): measures the solubility and synthetic accessibility of a compound
  • QED (Quantitative Estimate of Drug-likeness)
  • SA (Synthetic Accessibility): how hard/easy is to synthetize the molecule
  • MW (Molecular Weight)

Metrics

  • Frechet ChemNet Distance (FCD, used in MoLeR): measure how much realistic are generated molecules
  • Tanimoto similarity coefficient (Jaccard index): used to measure similarity between two molecules

Concepts

Molecular fingerprint

The molecular fingerprint is a bit-vector that encodes the structural features of a molecule and it's used, for example, to compare molecular similarity. There are many fingerprint encoding algorithms available, the "best one" strongly depends on the dataset and on the task.

Most encoding algorithms extract features from the molecule, hash them, and use the hash to compute the bit-vector position to set.

References

Tanimoto similarity

Tanimoto similarity, also known as Jaccard index, computes the similarity between two molecules.

Given the Morgan fingerprint of the two molecules, it's evaluated as:

sim(fp1, fp2) = intersect(fp1, fp2) / union(fp1, fp2)
References

Kekule structure

Same as Lewis structure for representing molecule geometry but without lone pairs and formal charges (electrons weren't discovered yet!).

GNN

References

Reinforcement learning:

Rdkit

Task we're interested in:

  • Constrained molecule optimization: very useful in drug discovery, the generation of new drugs usually starts with known molecules (such as existing drugs). The objective of this task is to generate a novel molecule, starting from an initial molecule, that improves its chemical properties.

Benchmarks

Guacamol is an evaluation framework based on a suite of standardised benchmarks for de-novo molecular design. It's thought to assess both classical methods and neural model -based methods.

Two benchmarks:

  • assess_distribution_learning: ability to generate molecules similar to those in a training set
@abstractmethod
def generate(self, number_samples: int) -> List[str]
    """
    Samples SMILES strings from a molecule generator.

    Args:
        number_samples: number of molecules to generate

    Returns:
        A list of SMILES strings.
    """
    pass
  • goal_directed_generator: ability to generate molecules that achieve a high score for a given scoring function
@abstractmethod
def generate_optimized_molecules(self, scoring_function: ScoringFunction, number_molecules: int,
                                 starting_population: Optional[List[str]] = None) -> List[str]:
    """
    Given an objective function, generate molecules that score as high as possible.

    Args:
        scoring_function: scoring function
        number_molecules: number of molecules to generate
        starting_population: molecules to start the optimization from (optional)

    Returns:
        A list of SMILES strings for the generated molecules.
    """
    pass

Implementation examples can be found at https://github.com/BenevolentAI/guacamol_baselines.

References:

A dataset obtained from filtering ZINC, used for benchmarking. Can be used to assess the overall quality of generated molecules.

Measured metrics:

  • Uniqueness (↑)
  • Validity (↑)
  • Fragment similarity (Frag) (↑): consine distance over vector of fragment frequencies between generated and test set
  • Scaffold similarity (Scaff) (↑): cosine distance over vector of scaffold frequencies between generated and test set
  • Nearest neighbor similarity (SNN) (↑), : average similarity of generated molecule with the nearest molecule from the test set
  • Internal diversity (IntDiv) (↑): pairwise similarity of generated molecules
  • FrΓ©chet ChemNet Distance (FCD) (↓): difference in distributions of last layer activations of ChemNet
  • Novelty (↑): fraction of unique valid generated molecules not present in the training set

TODO (UNDERSTAND): To compare molecular properties: Wasserstein-1 distance between distributions of molecules in the generated and test set

References

molgena's People

Contributors

loryruta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

sailfish009

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.