aspuru-guzik-group / group-selfies Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Hi,
When doing the following (might upload the large file if needed):
from group_selfies import (
fragment_mols,
Group,
MolecularGraph,
GroupGrammar
)
# extracting a set of reasonable groups using fragmentation
lotus = [x.strip() for i, x in enumerate(open("data/lotus_smiles.csv")) if i > 0]
subset = lotus
# import random
# random.seed(42)
# subset = random.sample(lotus, 1000)
fragments = fragment_mols(subset, convert=True, method="default") # use custom fragmentation technique
# Very slow
# fragments_mmpa = fragment_mols(subset, convert=True, method="mmpa") # use MMPA fragmentation
fragments_valid = remove_problematic_fragments(fragments)
vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])
# Very slow
# vocab_fragment_mmpa = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_mmpa)])
grammar_fragment = GroupGrammar(vocab=vocab_fragment)
# Very slow
# grammar_fragment_mmpa = GroupGrammar(vocab=vocab_fragment_mmpa)
grammar_fragment.to_file("lotus_grammar.txt")
# Very slow
# grammar_fragment_mmpa.to_file("lotus_grammar_mmpa.txt")
the
vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])
fails as some of the fragments cannot be grouped, with following message error:
[16:55:17] Explicit valence for atom # 5 C, 5, is greater than permitted
Traceback (most recent call last):
File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/utils/group_utils.py", line 73, in group_parser
Chem.SanitizeMol(mapped, sanitizeOps = Chem.SanitizeFlags.SANITIZE_ALL ^ Chem.SanitizeFlags.SANITIZE_CLEANUPCHIRALITY)# ^ Chem.SanitizeFlags.SANITIZE_FINDRADICALS)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 5 C, 5, is greater than permitted
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-81-4bd9ed7a6acf>", line 1, in <module>
vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments)])
File "<ipython-input-81-4bd9ed7a6acf>", line 1, in <listcomp>
vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments)])
File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/group_mol_graph.py", line 355, in __init__
self.mol = group_parser(self.canonsmiles, sanitize=sanitize)
File "/Users/adrutz/micromamba/envs/default/lib/python3.10/site-packages/group_selfies/utils/group_utils.py", line 77, in group_parser
raise ValueError(f'Issue in parsing group {s}')
ValueError: Issue in parsing group C[C@@H]1C(*1)C[C@H]2(*1)[C@H](C(*1)C(*1)[C@]3(C)C(*1)(*1)CC(*1)[C@H]32*1)[C@]1(CCC*1)C*1
Therefore, I had to do the following to make it work:
from group_selfies import (
fragment_mols,
Group,
MolecularGraph,
GroupGrammar
)
##
def remove_problematic_fragments(fragments):
valid_fragments = []
for fragment in fragments:
try:
Group(f"fragmentTest", fragment)
valid_fragments.append(fragment)
except:
print(f"Removing problematic fragment: {fragment}")
return valid_fragments
##
# extracting a set of reasonable groups using fragmentation
lotus = [x.strip() for i, x in enumerate(open("data/lotus_smiles.csv")) if i > 0]
subset = lotus
# import random
# random.seed(42)
# subset = random.sample(lotus, 1000)
fragments = fragment_mols(subset, convert=True, method="default") # use custom fragmentation technique
# Very slow
# fragments_mmpa = fragment_mols(subset, convert=True, method="mmpa") # use MMPA fragmentation
fragments_valid = remove_problematic_fragments(fragments)
vocab_fragment = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_valid)])
# Very slow
# vocab_fragment_mmpa = dict([(f"frag{idx}", Group(f"frag{idx}", frag)) for idx, frag in enumerate(fragments_mmpa)])
grammar_fragment = GroupGrammar(vocab=vocab_fragment)
# Very slow
# grammar_fragment_mmpa = GroupGrammar(vocab=vocab_fragment_mmpa)
grammar_fragment.to_file("lotus_grammar.txt")
# Very slow
# grammar_fragment_mmpa.to_file("lotus_grammar_mmpa.txt")
I suppose this should not be the case?
Hi,
I was playing with the nice tutorial provided in the repository. I have adapted the last cell of the notebook:
import numpy as np
import random
random.seed(42)
# to generate new molecules by combining random groups
# start with a grammar of groups, in this case `grammar_fragment` is a grammar of groups
# assign random overload_idxs
for group_name in grammar_fragment.vocab:
group = grammar_fragment.vocab[group_name]
group.overload_idx = random.randint(0, 100)
pass
group_names = list(grammar_fragment.vocab.keys())
# pick n random groups
n = 2
n_pop = 0 # add pop tokens for extra branching
random_groups = [random.choice(group_names) for _ in range(n)] + ['[pop]' for _ in range(n_pop)]
random.shuffle(random_groups)
used_groups = []
# combine them into a new group selfies string
new_gselfies = ''
for i, g in enumerate(random_groups):
if g == '[pop]':
new_gselfies += '[pop]'
continue
#set high prio
grammar_fragment.vocab[g].priority=50000
used_groups.append(grammar_fragment.vocab[g])
n_attachment_points = len(grammar_fragment.vocab[g].attachment_points)-1
start = random.randint(0, n_attachment_points)
random_range = np.arange(0, n_attachment_points)
random_range = [e for e in random_range if e!=start]
if not random_range or i == len(random_groups)-1:
new_block = f"[:{start}{g}]"
else:
end = random.choice(random_range)
new_block = f"[:{start}{g}]{INDEX_ALPHABET[end]}"
mol = grammar_fragment.vocab[g].mol
for atom in mol.GetAtoms():
atom.SetProp('atomLabel', str(atom.GetIdx() + 1))
img = Draw.MolToImage(mol)
print(g)
display(img)
new_gselfies += new_block
print(f"Generated group selfies {new_gselfies}")
#Create new GroupGrammer with used groups
grammar = GroupGrammar(used_groups)
print(f"grammar vocab {grammar.vocab}")
out = grammar.decoder(new_gselfies)
display(out)
ex = grammar.extract_groups(out)
print(f"Extracted groups", ex)
valid = grammar.full_encoder(out, join=True)
print(f"{valid=}")
out = grammar.decoder(valid)
display(out)
Thank you for making this code available! Based on your arXiv paper's abstract, this repository is meant to be open-source. However, it is not open-source until a license is added. Please add a(n open-source) license.
GitHub provides information on choosing a license here: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository#choosing-the-right-license
Hi, I want to know if this project could be used for MCS problem?
Many thanks
In an attempt to bind an ion to histidine using custom groups, I encountered an unexpected issue.
from rdkit import Chem
from rdkit.Chem import AllChem
from group_selfies import (
fragment_mols,
Group,
MolecularGraph,
GroupGrammar,
group_encoder
)
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import display
from test_utils import *
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
g = Group('ion', 'CCCCCCCCCCCCS(=O)(=O)(*1)') # create the ion group
amino = Group("histidine", "O=C([C@H](CC1=CN(*1)C=N1)N)O") # create the histidine group
grammar = GroupGrammar([g, amino])
new_selfies = "[:0histidine][:0ion]"
display(grammar.decoder(new_selfies)) # This may be a silent error and just displays histidine
new_selfies = "[:0histidine][:0ion][:0ion][:0histidine]"
display(grammar.decoder(new_selfies)) # This displays the desired molecule.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.