Giter Site home page Giter Site logo

group-selfies's People

Contributors

ahasson avatar auhcheng avatar loluwot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

group-selfies's Issues

Some of the fragments generated when using `fragment_mols` are not valid

Hi,

When doing the following (might upload the large file if needed):

from group_selfies import (
    fragment_mols, 
    Group, 
    MolecularGraph, 
    GroupGrammar
)


# extracting a set of reasonable groups using fragmentation
lotus = [x.strip() for i, x in enumerate(open("data/lotus_smiles.csv")) if i > 0]

subset = lotus
# import random
# random.seed(42)
# subset = random.sample(lotus, 1000)

fragments = fragment_mols(subset, convert=True, method="default") # use custom fragmentation technique
# Very slow
# fragments_mmpa = fragment_mols(subset, convert=True, method="mmpa") # use MMPA fragmentation

fragments_valid = remove_problematic_fragments(fragments)

vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])
# Very slow
# vocab_fragment_mmpa = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_mmpa)])

grammar_fragment = GroupGrammar(vocab=vocab_fragment)
# Very slow
# grammar_fragment_mmpa = GroupGrammar(vocab=vocab_fragment_mmpa)

grammar_fragment.to_file("lotus_grammar.txt")
# Very slow
# grammar_fragment_mmpa.to_file("lotus_grammar_mmpa.txt")

the

vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])

fails as some of the fragments cannot be grouped, with following message error:

[16:55:17] Explicit valence for atom # 5 C, 5, is greater than permitted
Traceback (most recent call last):
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/utils/group_utils.py", line 73, in group_parser
    Chem.SanitizeMol(mapped, sanitizeOps = Chem.SanitizeFlags.SANITIZE_ALL ^ Chem.SanitizeFlags.SANITIZE_CLEANUPCHIRALITY)# ^ Chem.SanitizeFlags.SANITIZE_FINDRADICALS)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 5 C, 5, is greater than permitted
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-81-4bd9ed7a6acf>", line 1, in <module>
    vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments)])
  File "<ipython-input-81-4bd9ed7a6acf>", line 1, in <listcomp>
    vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments)])
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/group_mol_graph.py", line 355, in __init__
    self.mol = group_parser(self.canonsmiles, sanitize=sanitize)
  File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/utils/group_utils.py", line 77, in group_parser
    raise ValueError(f'Issue in parsing group {s}')
ValueError: Issue in parsing group C[C@@H]1C(*1)C[C@H]2(*1)[C@H](C(*1)C(*1)[C@]3(C)C(*1)(*1)CC(*1)[C@H]32*1)[C@]1(CCC*1)C*1

Therefore, I had to do the following to make it work:

from group_selfies import (
    fragment_mols, 
    Group, 
    MolecularGraph, 
    GroupGrammar
)


##
def remove_problematic_fragments(fragments):
    valid_fragments = []
    for fragment in fragments:
        try:
            Group(f"fragmentTest", fragment)
            valid_fragments.append(fragment)
        except:
            print(f"Removing problematic fragment: {fragment}")
    return valid_fragments
##

# extracting a set of reasonable groups using fragmentation
lotus = [x.strip() for i, x in enumerate(open("data/lotus_smiles.csv")) if i > 0]

subset = lotus
# import random
# random.seed(42)
# subset = random.sample(lotus, 1000)

fragments = fragment_mols(subset, convert=True, method="default") # use custom fragmentation technique
# Very slow
# fragments_mmpa = fragment_mols(subset, convert=True, method="mmpa") # use MMPA fragmentation

fragments_valid = remove_problematic_fragments(fragments)

vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])
# Very slow
# vocab_fragment_mmpa = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_mmpa)])

grammar_fragment = GroupGrammar(vocab=vocab_fragment)
# Very slow
# grammar_fragment_mmpa = GroupGrammar(vocab=vocab_fragment_mmpa)

grammar_fragment.to_file("lotus_grammar.txt")
# Very slow
# grammar_fragment_mmpa.to_file("lotus_grammar_mmpa.txt")

I suppose this should not be the case?

Incorrect Recovery of Groups in Molecule Decoding/Encoding Process

Hi,
I was playing with the nice tutorial provided in the repository. I have adapted the last cell of the notebook:

import numpy as np
import random
random.seed(42)

# to generate new molecules by combining random groups
# start with a grammar of groups, in this case `grammar_fragment` is a grammar of groups

# assign random overload_idxs
for group_name in grammar_fragment.vocab:
    group = grammar_fragment.vocab[group_name]
    group.overload_idx = random.randint(0, 100)
    pass

group_names = list(grammar_fragment.vocab.keys())

# pick n random groups
n = 2

n_pop = 0 # add pop tokens for extra branching
random_groups = [random.choice(group_names) for _ in range(n)] + ['[pop]' for _ in range(n_pop)]
random.shuffle(random_groups)


used_groups = []
# combine them into a new group selfies string
new_gselfies = ''
for i, g in enumerate(random_groups):
    if g == '[pop]':
        new_gselfies += '[pop]'
        continue
    #set high prio
    grammar_fragment.vocab[g].priority=50000
    used_groups.append(grammar_fragment.vocab[g])

    n_attachment_points = len(grammar_fragment.vocab[g].attachment_points)-1
    start = random.randint(0, n_attachment_points)
    
    random_range = np.arange(0, n_attachment_points)    
    random_range = [e for e in random_range if e!=start]
    if not random_range or i == len(random_groups)-1:
        new_block = f"[:{start}{g}]"
    else:        
        end = random.choice(random_range)    
        new_block = f"[:{start}{g}]{INDEX_ALPHABET[end]}"

    mol = grammar_fragment.vocab[g].mol
    for atom in mol.GetAtoms():
        atom.SetProp('atomLabel', str(atom.GetIdx() + 1))
    img = Draw.MolToImage(mol)
    print(g)
    display(img)
    new_gselfies += new_block

print(f"Generated group selfies {new_gselfies}")

#Create new GroupGrammer with used groups
grammar = GroupGrammar(used_groups)   
print(f"grammar vocab {grammar.vocab}")
out = grammar.decoder(new_gselfies)
display(out)

ex = grammar.extract_groups(out)
print(f"Extracted groups", ex)
valid = grammar.full_encoder(out, join=True)
print(f"{valid=}")
out = grammar.decoder(valid)
display(out)

This gives:
image

Custom groups did not join liked expected

In an attempt to bind an ion to histidine using custom groups, I encountered an unexpected issue.

from rdkit import Chem
from rdkit.Chem import AllChem
from group_selfies import (
    fragment_mols, 
    Group, 
    MolecularGraph, 
    GroupGrammar, 
    group_encoder
)

from rdkit.Chem.Draw import IPythonConsole

from IPython.display import display
from test_utils import *
from rdkit import RDLogger

RDLogger.DisableLog('rdApp.*') 


g = Group('ion', 'CCCCCCCCCCCCS(=O)(=O)(*1)') # create the ion group
amino = Group("histidine", "O=C([C@H](CC1=CN(*1)C=N1)N)O") # create the histidine group

grammar = GroupGrammar([g, amino])

new_selfies = "[:0histidine][:0ion]"

display(grammar.decoder(new_selfies)) # This may be a silent error and just displays histidine

new_selfies = "[:0histidine][:0ion][:0ion][:0histidine]"

display(grammar.decoder(new_selfies)) # This displays the desired molecule. 


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.