Giter Site home page Giter Site logo

samoturk / mol2vec Goto Github PK

View Code? Open in Web Editor NEW
252.0 11.0 109.0 88.67 MB

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
python machine-learning cheminformatics

mol2vec's Introduction

Mol2vec

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

Documentation Status

Requirements

Installation

pip install git+https://github.com/samoturk/mol2vec

Documentation

Read the documentation on Read the Docs.

To build the documentation install sphinx, numpydoc and sphinx_rtd_theme and then run make html in docs directory.

Usage

As python module

from mol2vec import features
from mol2vec import helpers

First line imports functions to generate "sentences" from molecules and train the model, and second line imports functions useful for depictions. Check examples directory for more details and Mol2vec notebooks repository for visualisations made to easily run in Binder. Binder

Command line tool

Mol2vec is an unsupervised machine learning approach to learn vector representations of molecular substructures. Command line application has subcommands to prepare a corpus from molecular data (SDF or SMILES), train Mol2vec model and featurize new samples.

Subcommand 'corpus'

Generates corpus to train Mol2vec model. It generates morgan identifiers (up to selected radius) which represent words (molecules are sentences). Words are ordered in the sentence according to atom order in canonical SMILES (generated when generating corpus) and at each atom starting by identifier at radius 0.
Corpus subcommand also optionally replaces rare identifiers with selected string (e.g. UNK) which can be later used to represent completely new substructures (i.e. at featurization step). NOTE: It saves the corpus with replaced uncommon identifiers in separate file with ending "_{selected string to replace uncommon}". Since this is unsupervised method we recommend using as much molecules as possible (e.g. complete ZINC database).

Performance:

Corpus generation using 20M compounds with replacement of uncommon identifiers takes 6 hours on 4 cores.

Example:

To prepare a corpus using radius 1, 4 cores, replace uncommon identifiers that appear <= 3 times with 'UNK' run: mol2vec corpus -i mols.smi -o mols.cp -r 1 -j 4 --uncommon UNK --threshold 3

Subcommand 'train'

Trains Mol2vec model using previously prepared corpus.

Performance:

Training the model on 20M sentences takes ~2 hours on 4 cores.

Example:

To train a Mol2vec model on corpus with replaced uncommon identifiers using Skip-gram, window size 10, generating 300 dimensional vectors and using 4 cores run: mol2vec train -i mols.cp_UNK -o model.pkl -d 300 -w 10 -m skip-gram --threshold 3 -j 4

Subcommand 'featurize'

Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated during featurization) and all potential SD fields from input SDF file and finally followed by mol2vec-{0 to n-1} where n is dimensionality of embeddings in the model.

Example:

To featurize new samples using pre-trained embeddings and using vector trained on uncommon samples to represent new substructures: mol2vec featurize -i new.smi -o new.csv -m model.pkl -r 1 --uncommon UNK

For more detail on individual subcommand run: mol2vec $sub-command --help

How to cite?

@article{doi:10.1021/acs.jcim.7b00616,
author = {Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
title = {Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
journal = {Journal of Chemical Information and Modeling},
volume = {0},
number = {ja},
pages = {null},
year = {0},
doi = {10.1021/acs.jcim.7b00616},

URL = {http://dx.doi.org/10.1021/acs.jcim.7b00616},
eprint = {http://dx.doi.org/10.1021/acs.jcim.7b00616}
}

Sponsor info

Initial development was supported by BioMed X Innovation Center, Heidelberg.

mol2vec's People

Contributors

samoturk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mol2vec's Issues

Feature size

Thank for providing the implementation. I have a question when I ran the code: when input various molecules, the mol2alt_sentence function seems to output encodings with various lengths thus the final DfVec also include multiple 300 dimension features. I am wondering how to aggregate them?

I am just using the standard pipeline but with different input molecules:
sentence=mol2alt_sentence(mol, radius=1) sentence_obj=MolSentence(sentence) mol_vec=DfVec(sentences2vec(sentence_obj, model, unseen='UNK'))

Thanks!

Feng

filter criteria

Hi, first, thanks for making this great OSS library, much appreciated.

In the article, it is indicated that only the following elements are allowed to appear in the smiles molecule. Will lowercase letters be included? Some atoms Such as c,o,h,n....
image

I can't download Zinc15. May you provide a way to download it.

Convert back Base64encoded string of ROMol into an object of <rdkit.Chem.rdchem.Mol>

I have a base64encoded string of ROMOL which needs to be converted back to it's original object form. I am using the base64encoded string to write the data into the text format, but after reading the text format I need to convert it back into it's object so that I can generate embeddings using the object via the function "mol2alt_sentence".

I am only able to save the base64encoded romol in string format instead of serialized object. Therefore, the functionality of converting the string back to the object is required. Please suggest any alternative solution for this issue.

Transfer learning

Hi, first, thanks for making this great OSS library, much appreciated.

Im interested in taking the pretrained ZINC based model and further train it with sentences from my data set. The reason for this approach is that my dataset is small, only a few thousand SMILES. So Im trying to take a page from the CNN book where you can take a pretrained image recognition model and further specialized for your use case.

Is there any way to achieve that within your library?

Error when loading model

Hi, Thanks for putting together the notebook to explore mol2vec. I am getting the following error however when loading the model using model = KeyedVectors.load('model_300dim.pkl')

AttributeError: 'Word2Vec' object has no attribute 'vocabulary'

I am using gensim v 3.3.0.

Also, I noticed there are 2 versions of the 'model_300dim.pkl' file, one is around 25 Mbs and another around 74 Mbs. Which one should be used? I have tried both versions and see the same error. Thanks for any help!

update sentences2vec function for gensim 4.0

def sentences2vec(sentences, model, unseen=None):
    """Generate vectors for each sentence (list) in a list of sentences. Vector is simply a
    sum of vectors for individual words.
    
    Parameters
    ----------
    sentences : list, array
        List with sentences
    model : word2vec.Word2Vec
        Gensim word2vec model
    unseen : None, str
        Keyword for unseen words. If None, those words are skipped.
        https://stats.stackexchange.com/questions/163005/how-to-set-the-dictionary-for-text-analysis-using-neural-networks/163032#163032

    Returns
    -------
    np.array
    """
    
    keys = set(model.wv.key_to_index)
    vec = []
    
    if unseen:
        unseen_vec = model.wv.get_vector(unseen)

    for sentence in sentences:
        if unseen:
            vec.append(sum([model.wv.get_vector(y) if y in set(sentence) & keys
                       else unseen_vec for y in sentence]))
        else:
            vec.append(sum([model.wv.get_vector(y) for y in sentence 
                            if y in set(sentence) & keys]))
    return np.array(vec)```

Convert sentences to mol

After applying mol2alt_sentence to get the molecular sentence, is there any way to convert this back to the Mol object?

E.g. I have the sentence ['1016841875', '198706261', '2245384272', '2909042096', '2245384272', '2909042096', '1016841875', '198706261'] - can I convert this back to an rdkit.Chem.rdchem.Mol object?

I have found the object mol2vec.helpers.IdentifierTable but I'm unsure what's used for or if its helpful.

PS: the mol2vec project is a great implementation and very helpful for my research so far!

How Do You Convert RDKIT molecule to your fingerprint key ?

Hi !

We are very excited about using your project. We have the notebook samples working and we want to try on our own RDKIT molecules.

Unfortunately, we don't know how to convert RDKIT molecules into the morgan fingerprints that you are using as keys into the embedding dictionary. We can convert RDKIT molecules to bit vectors, but can't seem to match RDKIT molecules into your non-bit-vector representation (integers?).

Please advise !

Thanks !

Type Error when running with Python 3.8

When running with Python 3.8, I get the following error message:

Featurizing molecules.
Traceback (most recent call last):
File "/Users/dkazempour/opt/anaconda3/bin/mol2vec", line 8, in
sys.exit(run())
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 165, in run
args.func(args)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 25, in do_featurize
features.featurize(args.in_file, args.out_file, args.model, args.radius, args.uncommon)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/features.py", line 465, in featurize
word2vec_model[uncommon]
TypeError: 'Word2Vec' object is not subscriptable

Which library is causing this issue?

Update: I recognized that my observation is related to the other issue titled "update sentences2vec function for gensim 4.0" by Maledive.

The causing library is gensim. Something has changed at the 4.x.x versions, which yields the above stated error.

A temporary 'fix' (actually a quite quick-n-dirty hack) is as follows: pip install -Iv gensim==3.8.2
Afterwards I could successfully run mol2vec again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.