samoturk / mol2vec Goto Github PK

View Code? Open in Web Editor NEW

252.0 11.0 109.0 88.67 MB

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python machine-learning cheminformatics

mol2vec's Introduction

Mol2vec

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

Requirements

Python 3 (Python 2.x is not supported)
NumPy
matplotlib
seaborn
pandas
IPython
RDKit
scikit-learn
gensim
tqdm
joblib

Installation

pip install git+https://github.com/samoturk/mol2vec

Documentation

Read the documentation on Read the Docs.

To build the documentation install sphinx, numpydoc and sphinx_rtd_theme and then run make html in docs directory.

Usage

As python module

from mol2vec import features
from mol2vec import helpers

First line imports functions to generate "sentences" from molecules and train the model, and second line imports functions useful for depictions. Check examples directory for more details and Mol2vec notebooks repository for visualisations made to easily run in Binder.

Command line tool

Mol2vec is an unsupervised machine learning approach to learn vector representations of molecular substructures. Command line application has subcommands to prepare a corpus from molecular data (SDF or SMILES), train Mol2vec model and featurize new samples.

Subcommand 'corpus'

Generates corpus to train Mol2vec model. It generates morgan identifiers (up to selected radius) which represent words (molecules are sentences). Words are ordered in the sentence according to atom order in canonical SMILES (generated when generating corpus) and at each atom starting by identifier at radius 0.
Corpus subcommand also optionally replaces rare identifiers with selected string (e.g. UNK) which can be later used to represent completely new substructures (i.e. at featurization step). NOTE: It saves the corpus with replaced uncommon identifiers in separate file with ending "_{selected string to replace uncommon}". Since this is unsupervised method we recommend using as much molecules as possible (e.g. complete ZINC database).

Performance:

Corpus generation using 20M compounds with replacement of uncommon identifiers takes 6 hours on 4 cores.

Example:

To prepare a corpus using radius 1, 4 cores, replace uncommon identifiers that appear <= 3 times with 'UNK' run: mol2vec corpus -i mols.smi -o mols.cp -r 1 -j 4 --uncommon UNK --threshold 3

Subcommand 'train'

Trains Mol2vec model using previously prepared corpus.

Performance:

Training the model on 20M sentences takes ~2 hours on 4 cores.

Example:

To train a Mol2vec model on corpus with replaced uncommon identifiers using Skip-gram, window size 10, generating 300 dimensional vectors and using 4 cores run: mol2vec train -i mols.cp_UNK -o model.pkl -d 300 -w 10 -m skip-gram --threshold 3 -j 4

Subcommand 'featurize'

Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated during featurization) and all potential SD fields from input SDF file and finally followed by mol2vec-{0 to n-1} where n is dimensionality of embeddings in the model.

Example:

To featurize new samples using pre-trained embeddings and using vector trained on uncommon samples to represent new substructures: mol2vec featurize -i new.smi -o new.csv -m model.pkl -r 1 --uncommon UNK

For more detail on individual subcommand run: mol2vec $sub-command --help

How to cite?

@article{doi:10.1021/acs.jcim.7b00616,
author = {Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
title = {Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
journal = {Journal of Chemical Information and Modeling},
volume = {0},
number = {ja},
pages = {null},
year = {0},
doi = {10.1021/acs.jcim.7b00616},

URL = {http://dx.doi.org/10.1021/acs.jcim.7b00616},
eprint = {http://dx.doi.org/10.1021/acs.jcim.7b00616}
}

Sponsor info

Initial development was supported by BioMed X Innovation Center, Heidelberg.

mol2vec's People

Contributors

Stargazers

Watchers

Forkers

biterbilen nkucodingcat tianflame afcarl amoliu fzamberlan dennissheberla thegodone fathyshalaby manzoorelahi lizhizhi7 nanomolar muu4649 zfang2019 unixjunkie 0x0all wujinkui cddt justinjjvanderhooft mwang87 ashwinhegde comocheng littleflow3r jenkescheen thntran qshao xybai-dev ashar799 michaelmaser xuzhang5788 xinhaoli74 antonsperera eehiter sparklingredstar ghostintheshellarise alperyilmaz xuanlin1991 ebcandir laserkelvin gunasekhar1420 lclindu pengyayuan rohorne07 u0m0z lionelxia xianmingzhang zhangdachuanfoodies orionisbio ammar257ammar lol88 minghao2016 askery yenson-lau colliner cchang373 tonyreina natnaelt kingscolour dot23 adenkics qcojuandavidmarin sailfish009 wsxy62 ws1997812 changx03 ayona123 omeygun dk-teknologisk-rtfh ratul2200 sushimin1744 chrinide pctskate rnaimehaom ehsanshahini mihohatanaka arrepath guyrosin ipark2021 garycao-45 yjsgcjdfz123 aolgac adiitya-dey gtaghon ouc6809 nasrinsaalehi mittalboi rbryan13 phuongnvp dandybit kostopr4v nataliyah123 aravindhnivas d1ffic00lt dylorone hendrixmar mmkuznecov suitst michieltaw gmaikelc alexxx0323

mol2vec's Issues

searching similar molecules

Hi Sam,

Is it possible to perform similarity search with the model ? Any ideas how could it be done ?

Feature size

Thank for providing the implementation. I have a question when I ran the code: when input various molecules, the mol2alt_sentence function seems to output encodings with various lengths thus the final DfVec also include multiple 300 dimension features. I am wondering how to aggregate them?

I am just using the standard pipeline but with different input molecules:
sentence=mol2alt_sentence(mol, radius=1) sentence_obj=MolSentence(sentence) mol_vec=DfVec(sentences2vec(sentence_obj, model, unseen='UNK'))

Thanks!

Feng

blog: experiment using mol2vec

Hi,

I've done some similarity search experiments using your project mol2vec. Here is a draft of my blog. I'd appreciate if you could please read it and give me some feedback. Thank you!

draft link: https://medium.com/@sabrina.sj.ho/tanimoto-vs-mol2vec-7fa4af3208ef

ﬁlter criteria

Hi, first, thanks for making this great OSS library, much appreciated.

In the article, it is indicated that only the following elements are allowed to appear in the smiles molecule. Will lowercase letters be included? Some atoms Such as c,o,h,n....

I can't download Zinc15. May you provide a way to download it.

Convert back Base64encoded string of ROMol into an object of <rdkit.Chem.rdchem.Mol>

I have a base64encoded string of ROMOL which needs to be converted back to it's original object form. I am using the base64encoded string to write the data into the text format, but after reading the text format I need to convert it back into it's object so that I can generate embeddings using the object via the function "mol2alt_sentence".

I am only able to save the base64encoded romol in string format instead of serialized object. Therefore, the functionality of converting the string back to the object is required. Please suggest any alternative solution for this issue.

Transfer learning

Hi, first, thanks for making this great OSS library, much appreciated.

Im interested in taking the pretrained ZINC based model and further train it with sentences from my data set. The reason for this approach is that my dataset is small, only a few thousand SMILES. So Im trying to take a page from the CNN book where you can take a pretrained image recognition model and further specialized for your use case.

Is there any way to achieve that within your library?

Error when loading model

Hi, Thanks for putting together the notebook to explore mol2vec. I am getting the following error however when loading the model using model = KeyedVectors.load('model_300dim.pkl')

AttributeError: 'Word2Vec' object has no attribute 'vocabulary'

I am using gensim v 3.3.0.

Also, I noticed there are 2 versions of the 'model_300dim.pkl' file, one is around 25 Mbs and another around 74 Mbs. Which one should be used? I have tried both versions and see the same error. Thanks for any help!

pretrained models

hi @samoturk ,

Are the pre-trained models available / will be made available in the future?

best,
Miha

update sentences2vec function for gensim 4.0

def sentences2vec(sentences, model, unseen=None):
    """Generate vectors for each sentence (list) in a list of sentences. Vector is simply a
    sum of vectors for individual words.
    
    Parameters
    ----------
    sentences : list, array
        List with sentences
    model : word2vec.Word2Vec
        Gensim word2vec model
    unseen : None, str
        Keyword for unseen words. If None, those words are skipped.
        https://stats.stackexchange.com/questions/163005/how-to-set-the-dictionary-for-text-analysis-using-neural-networks/163032#163032

    Returns
    -------
    np.array
    """
    
    keys = set(model.wv.key_to_index)
    vec = []
    
    if unseen:
        unseen_vec = model.wv.get_vector(unseen)

    for sentence in sentences:
        if unseen:
            vec.append(sum([model.wv.get_vector(y) if y in set(sentence) & keys
                       else unseen_vec for y in sentence]))
        else:
            vec.append(sum([model.wv.get_vector(y) for y in sentence 
                            if y in set(sentence) & keys]))
    return np.array(vec)```

Convert sentences to mol

After applying mol2alt_sentence to get the molecular sentence, is there any way to convert this back to the Mol object?

E.g. I have the sentence ['1016841875', '198706261', '2245384272', '2909042096', '2245384272', '2909042096', '1016841875', '198706261'] - can I convert this back to an rdkit.Chem.rdchem.Mol object?

I have found the object mol2vec.helpers.IdentifierTable but I'm unsure what's used for or if its helpful.

PS: the mol2vec project is a great implementation and very helpful for my research so far!

How Do You Convert RDKIT molecule to your fingerprint key ?

Hi !

We are very excited about using your project. We have the notebook samples working and we want to try on our own RDKIT molecules.

Unfortunately, we don't know how to convert RDKIT molecules into the morgan fingerprints that you are using as keys into the embedding dictionary. We can convert RDKIT molecules to bit vectors, but can't seem to match RDKIT molecules into your non-bit-vector representation (integers?).

Please advise !

Thanks !

Can't Clone it

It stops after running pip install git+https://github.com/samoturk/mol2vec and does not move forward

rdkit with tensorflow

Type Error when running with Python 3.8

When running with Python 3.8, I get the following error message:

Featurizing molecules.
Traceback (most recent call last):
File "/Users/dkazempour/opt/anaconda3/bin/mol2vec", line 8, in
sys.exit(run())
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 165, in run
args.func(args)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/app/mol2vec.py", line 25, in do_featurize
features.featurize(args.in_file, args.out_file, args.model, args.radius, args.uncommon)
File "/Users/dkazempour/opt/anaconda3/lib/python3.8/site-packages/mol2vec/features.py", line 465, in featurize
word2vec_model[uncommon]
TypeError: 'Word2Vec' object is not subscriptable

Which library is causing this issue?

Update: I recognized that my observation is related to the other issue titled "update sentences2vec function for gensim 4.0" by Maledive.

The causing library is gensim. Something has changed at the 4.x.x versions, which yields the above stated error.

A temporary 'fix' (actually a quite quick-n-dirty hack) is as follows: pip install -Iv gensim==3.8.2
Afterwards I could successfully run mol2vec again.

samoturk / mol2vec Goto Github PK

mol2vec's Introduction

Mol2vec

Requirements

Installation

Documentation

Usage

As python module

Command line tool

Subcommand 'corpus'

Performance:

Example:

Subcommand 'train'

Performance:

Example:

Subcommand 'featurize'

Example:

How to cite?

Sponsor info

mol2vec's People

Contributors

Stargazers

Watchers

Forkers

mol2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org