aspuru-guzik-group / group-selfies Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 9.0 10.3 MB

License: Apache License 2.0

Python 2.30% Jupyter Notebook 97.70%

group-selfies's People

Contributors

Stargazers

Watchers

Forkers

minghao2016 rnaimehaom santiadavani unixjunkie ahasson mathcom takshan d1hr2uv yuyuan871111

group-selfies's Issues

Some of the fragments generated when using `fragment_mols` are not valid

Hi,

When doing the following (might upload the large file if needed):

from group_selfies import (
    fragment_mols, 
    Group, 
    MolecularGraph, 
    GroupGrammar
)


# extracting a set of reasonable groups using fragmentation
lotus = [x.strip() for i, x in enumerate(open("data/lotus_smiles.csv")) if i > 0]

subset = lotus
# import random
# random.seed(42)
# subset = random.sample(lotus, 1000)

fragments = fragment_mols(subset, convert=True, method="default") # use custom fragmentation technique
# Very slow
# fragments_mmpa = fragment_mols(subset, convert=True, method="mmpa") # use MMPA fragmentation

fragments_valid = remove_problematic_fragments(fragments)

vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])
# Very slow
# vocab_fragment_mmpa = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_mmpa)])

grammar_fragment = GroupGrammar(vocab=vocab_fragment)
# Very slow
# grammar_fragment_mmpa = GroupGrammar(vocab=vocab_fragment_mmpa)

grammar_fragment.to_file("lotus_grammar.txt")
# Very slow
# grammar_fragment_mmpa.to_file("lotus_grammar_mmpa.txt")

the

vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])

fails as some of the fragments cannot be grouped, with following message error:

[16:55:17] Explicit valence for atom # 5 C, 5, is greater than permitted
Traceback (most recent call last):
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/utils/group_utils.py", line 73, in group_parser
    Chem.SanitizeMol(mapped, sanitizeOps = Chem.SanitizeFlags.SANITIZE_ALL ^ Chem.SanitizeFlags.SANITIZE_CLEANUPCHIRALITY)# ^ Chem.SanitizeFlags.SANITIZE_FINDRADICALS)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 5 C, 5, is greater than permitted
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-81-4bd9ed7a6acf>", line 1, in <module>
    vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments)])
  File "<ipython-input-81-4bd9ed7a6acf>", line 1, in <listcomp>
    vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments)])
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/group_mol_graph.py", line 355, in __init__
    self.mol = group_parser(self.canonsmiles, sanitize=sanitize)
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/utils/group_utils.py", line 77, in group_parser
    raise ValueError(f'Issue in parsing group {s}')
ValueError: Issue in parsing group C[C@@H]1C(*1)C[C@H]2(*1)[C@H](C(*1)C(*1)[C@]3(C)C(*1)(*1)CC(*1)[C@H]32*1)[C@]1(CCC*1)C*1

Therefore, I had to do the following to make it work:

from group_selfies import (
    fragment_mols, 
    Group, 
    MolecularGraph, 
    GroupGrammar
)


##
def remove_problematic_fragments(fragments):
    valid_fragments = []
    for fragment in fragments:
        try:
            Group(f"fragmentTest", fragment)
            valid_fragments.append(fragment)
        except:
            print(f"Removing problematic fragment: {fragment}")
    return valid_fragments
##

# extracting a set of reasonable groups using fragmentation
lotus = [x.strip() for i, x in enumerate(open("data/lotus_smiles.csv")) if i > 0]

subset = lotus
# import random
# random.seed(42)
# subset = random.sample(lotus, 1000)

fragments = fragment_mols(subset, convert=True, method="default") # use custom fragmentation technique
# Very slow
# fragments_mmpa = fragment_mols(subset, convert=True, method="mmpa") # use MMPA fragmentation

fragments_valid = remove_problematic_fragments(fragments)

vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])
# Very slow
# vocab_fragment_mmpa = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_mmpa)])

grammar_fragment = GroupGrammar(vocab=vocab_fragment)
# Very slow
# grammar_fragment_mmpa = GroupGrammar(vocab=vocab_fragment_mmpa)

grammar_fragment.to_file("lotus_grammar.txt")
# Very slow
# grammar_fragment_mmpa.to_file("lotus_grammar_mmpa.txt")

I suppose this should not be the case?

Incorrect Recovery of Groups in Molecule Decoding/Encoding Process

Hi,
I was playing with the nice tutorial provided in the repository. I have adapted the last cell of the notebook:

import numpy as np
import random
random.seed(42)

# to generate new molecules by combining random groups
# start with a grammar of groups, in this case `grammar_fragment` is a grammar of groups

# assign random overload_idxs
for group_name in grammar_fragment.vocab:
    group = grammar_fragment.vocab[group_name]
    group.overload_idx = random.randint(0, 100)
    pass

group_names = list(grammar_fragment.vocab.keys())

# pick n random groups
n = 2

n_pop = 0 # add pop tokens for extra branching
random_groups = [random.choice(group_names) for _ in range(n)] + ['[pop]' for _ in range(n_pop)]
random.shuffle(random_groups)


used_groups = []
# combine them into a new group selfies string
new_gselfies = ''
for i, g in enumerate(random_groups):
    if g == '[pop]':
        new_gselfies += '[pop]'
        continue
    #set high prio
    grammar_fragment.vocab[g].priority=50000
    used_groups.append(grammar_fragment.vocab[g])

    n_attachment_points = len(grammar_fragment.vocab[g].attachment_points)-1
    start = random.randint(0, n_attachment_points)
    
    random_range = np.arange(0, n_attachment_points)    
    random_range = [e for e in random_range if e!=start]
    if not random_range or i == len(random_groups)-1:
        new_block = f"[:{start}{g}]"
    else:        
        end = random.choice(random_range)    
        new_block = f"[:{start}{g}]{INDEX_ALPHABET[end]}"

    mol = grammar_fragment.vocab[g].mol
    for atom in mol.GetAtoms():
        atom.SetProp('atomLabel', str(atom.GetIdx() + 1))
    img = Draw.MolToImage(mol)
    print(g)
    display(img)
    new_gselfies += new_block

print(f"Generated group selfies {new_gselfies}")

#Create new GroupGrammer with used groups
grammar = GroupGrammar(used_groups)   
print(f"grammar vocab {grammar.vocab}")
out = grammar.decoder(new_gselfies)
display(out)

ex = grammar.extract_groups(out)
print(f"Extracted groups", ex)
valid = grammar.full_encoder(out, join=True)
print(f"{valid=}")
out = grammar.decoder(valid)
display(out)

This gives:

how to get SMILES back from group-selfies?

Please add a license

Thank you for making this code available! Based on your arXiv paper's abstract, this repository is meant to be open-source. However, it is not open-source until a license is added. Please add a(n open-source) license.

GitHub provides information on choosing a license here: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository#choosing-the-right-license

Most common substructure search

Hi, I want to know if this project could be used for MCS problem?

Many thanks

Custom groups did not join liked expected

In an attempt to bind an ion to histidine using custom groups, I encountered an unexpected issue.

from rdkit import Chem
from rdkit.Chem import AllChem
from group_selfies import (
    fragment_mols, 
    Group, 
    MolecularGraph, 
    GroupGrammar, 
    group_encoder
)

from rdkit.Chem.Draw import IPythonConsole

from IPython.display import display
from test_utils import *
from rdkit import RDLogger

RDLogger.DisableLog('rdApp.*') 


g = Group('ion', 'CCCCCCCCCCCCS(=O)(=O)(*1)') # create the ion group
amino = Group("histidine", "O=C([C@H](CC1=CN(*1)C=N1)N)O") # create the histidine group

grammar = GroupGrammar([g, amino])

new_selfies = "[:0histidine][:0ion]"

display(grammar.decoder(new_selfies)) # This may be a silent error and just displays histidine

new_selfies = "[:0histidine][:0ion][:0ion][:0histidine]"

display(grammar.decoder(new_selfies)) # This displays the desired molecule.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.