jacksonburns / astartes Goto Github PK

View Code? Open in Web Editor NEW

52.0 3.0 2.0 20.89 MB

Better Data Splits for Machine Learning

Home Page: https://jacksonburns.github.io/astartes/

License: MIT License

Python 100.00%

ai ml data-science machine-learning sampling python

astartes's Introduction

Welcome to my GitHub!

I dug this comment up while attempting to compile an ancient computational chemistry program written in Fortran 77:

The following notes are discoveries made while attempting to figure out how this mess works. They are here so that you, poor soul that you are, don't have to reinvent the wheel (well, at least not the whole thing).

Hopefully my projects don't cause you anywhere near this level of frustration.

astartes's People

Contributors

Stargazers

Watchers

Forkers

himaghna hfooladi

astartes's Issues

[BUG]: Repository Configuration Issues

Describe the bug

Two small things:

The "Question" issue template still says Chemprop in the title
The Update JOSS Paper action should be run on pushes to main in addition to its current configuration

Add Tests for Kennard-Stone Sampler

The Kennard Stone sampling algorithm has already been implemented, but needs unit tests.

Add Green Group Reaction dataset Example Notebook

Edit the example Jupyter notebook, demonstrating the efficacy of non-random (deterministic) splitting.

[BUG]: Scaffold sampler inaccessible from `train_(val_)test_split_molecules`

Describe the bug

The astartes.molecules functions for splitting automatically featurize the input SMILES, which raises exceptions when attempting to use the scaffold sampler.

Expected behavior

train_cal_test_split_molecules should skip the featurization with AIMSim step when the scaffold sampler is used.

[FEATURE]: Spectral Clustering based on OptiSim Selection

Spectral Clustering based on OptiSim Selection is an interesting extension of existing methods that may be more scalable. Further reading here.

DUPLEX Sampler

Add tests and implementation

Time Split Sampler

Add tests and implementation

Add a README Section & Docs page "How to Switch from sklearn.model_selection" to "astartes"

[FEATURE]: Add some randomness to split assignments of clusters

Is your feature request related to a problem? Please describe.

Currently the clusters resulting from an extrapolative sampling algorithm are sorted from smallest to largest and assigned into the testing, validation, and training sets without any randomness being possible in the assignments. This is fine, but not useful for stratified sampling.

Use-cases/examples of this new feature

Stratified sampling will require some randomness in order to function, and this will enable us to make random_state a kwarg in train_val_test_split and train_test_split which is nice for interoperability (we can do this now, but not before, since now all samplers (except KS) will use random_state).

Desired solution/workflow

Shuffle the small (less than the size of the test and validation splits) clusters and then assign them.

Potentially add factory

One idea to avoid these many if statements link is to add a factory. An example is the ESS adapter and factory in RMG-Py's Arkane so at the bottom of each adapter we just add a line to register the adapter link then we could just do something like _registered_adapters[sampler_name] rather than a long chain of if, elif, elif, elif, elif, elif etc

[FEATURE]: SPlit method for partitioning data

https://arxiv.org/pdf/2012.10945.pdf

D-Optimal

Add tests and implementation

[FEATURE]: Mahalanobis Distance Kennard-Stone (MDKS) Sampler

Is your feature request related to a problem? Please describe.

Better interpolative splits for Artificial Neural Networks in particular with Mahalanobis Distance Kennard-Stone here.

Use-cases/examples of this new feature

See linked paper for specific examples, but is reported to generally provide data splits for ANN applications.

Desired solution/workflow

Using the base implementation of the Kennard-Stone sampler, implement this.

Discussion

Unforunately without re-writing the Mahalanobis distance method to accept pre-computed pairwise distances this method will scale at least O(n^3)

README Update

remove section about JOSS
revise section about interfaces to be only about molecules
create a table of implemented methods, as well as the source repositories where the code was retrieved from or where the dependency is (some may need to be manually copied into astartes) if not installable via PyPI

[FEATURE]: Put in rdkit objects directly

Is your feature request related to a problem? Please describe.

I have a bunch of adsorbed species, and can make then directly into rdkit objects fairly easily. However, to make them into astartes/AIMSIM inputs, I convert it back into a smiles string, and then input it in.

Use-cases/examples of this new feature

Periodic DFT adsorbed structure can be approximated with a chunk of the surface the adsorbate is put on + the adsorbate; this gives very complicated SMILES strings, plus rdkit has to sanitize them.
I can get my geometries into rdkit format just the way I need to, but converting it to a SMILES string is a roundabout way to to analytics on it.

Desired solution/workflow

Unfortunately I have no idea what pseudocode would do this, but being able to put in a list of rdkit objects as input would be massively helpful.

Discussion

Additional context

Thank you!

Density-Based Spatial Clustering of Applications with Noise

Add tests and implementation

Add Tests for Random Sampler

Random sampler has already been implemented, need to test that it is working with unittests. Will require fixing the random seed when calling the sampler.

[BUG]: Outdated short description in `pyproject.toml`

Describe the bug

The pyproject.toml still refers to support for molecules, images, and arbitrary vectors but images have since been removed.

Expected behavior

Remove this so that the package description will show correctly on PyPI.

[BUG]: Error instantiating train_test_split_molecules()

Describe the bug

Using the example train_test_split_molecules() settings in the readme, but commenting out the y=y (just want to get out group splits), I get the following error:
TypeError: train_test_split_molecules() got an unexpected keyword argument 'splitter'

Example(s)

smiles = list(open('smiles.txt')

train_test_split_molecules(
smiles=smiles,
# y=y,
test_size=0.2,
train_size=0.8,
fingerprint="daylight_fingerprint",
fprints_hopts={
    "minPath": 2,
    "maxPath": 5,
    "fpSize": 200,
    "bitsPerHash": 4,
    "useHs": 1,
    "tgtDensity": 0.4,
    "minSize": 64,
},
splitter="random",
hopts={
    "random_state": 42,
    "shuffle": True,
},
)

Expected behavior

Expected train_test_split instance

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

3.11
package versions: aimsim-2.0.2, astartes 1.0.0b2
Windows

Checklist

the unit tests are working: pytest -v reports no errors: Error collecting test session

Additional context

[FEATURE]: Set a `DEFAULT_RANDOM_SEED` in `main.py`

Is your feature request related to a problem? Please describe.

The use of a random seed is not unified across astartes

Use-cases/examples of this new feature

With this in place, we could exactly reproduce the sampling results between subsequent calls of astartes. This would mean that even the non-deterministic samplers would, by default, do the same thing everytime like the deterministic ones do.

Desired solution/workflow

Set DEFAULT_RANDOM_SEED=42 in main.py, override it in train_val_test_split if the user provides it, and then everywhere else in astartes that uses randomness should import from astartes import DEFAULT_RANDOM_SEED.

[FEATURE]: Multi-Fidelity Sampling

Is your feature request related to a problem? Please describe.

Given a dataset with both high and low fidelity data points, partition the dataset based on the fidelity (extrapolatively).

[FEATURE]: Minimal Test Set Dissimilarity (MTSD)

Add tests and implementation

Scaffold Sampler

Constant Integration for Example Notebook

After the example Jupyter notebook is added, add a GitHub action to do CI on it in the same manner as this action.

Switch packaging to `pyproject.toml`

setuptools seems to be moving in this direction - we might as well get ahead of the curve. With this issue, we can also add the optional pip extra install 'molecules' which will give the user access to the molecules interface. We want to avoid including this in the base install since it is very version strict and may interfere with other development environments.

Fix PR Review Action

The PR review GitHub action currently annoys the user after every push. Borrow from this action in py2sambvca to make it only leave comments after the PR is marked ready for review.

Implement return_as in train_test_split_molecules

Release 1.0.0a0 for initial packaging testing

Following changes to the repository layout accoring to #15 and #17 astartes will need to be repackaged as an alpha release.

[FEATURE]: SWNW Sampler

Another sampling described here.

Discussion about standardizing data

Typical best practices in data science (as well as this example from sklearn) indicate that feature vectors should normally be z-scored before entering the DBSCAN algorithm. This seems sensible and more generally should probably be done before any distance or similarity metric is applied to create clusters for subsequent data splitting. Perhaps a future PR could add the option to z-score the data. However, I'm also fine if we leave this to the user since I believe the goal of this repo is just to create data splits. To make this generalizable beyond cheminformatics, I think we're intentionally offloading the choice of featurization to users who are subject matter experts in their field and probably know more than we do about how to best represent their problem; for example, maybe some features are more important and z-scoring would obfuscate that idk

[FEATURE]: Kohonen Self-Organizing Map

Could prove to be an interesting way to cluster high-dimensional data, see this example from MiniSOM.

Add support for returning indices rather than the passed objects directly

[BUG]: train_test_split get KeyError when X is pd.DataFrame

Describe the bug

train_test_split get KeyError when X is pd.DataFrame

Example(s)

from astartes import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
  X,
  y,
  sampler = 'kennard_stone',
)

get:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File <timed exec>:4

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:114, in train_test_split(X, y, labels, train_size, test_size, sampler, random_state, hopts, return_indices)
     87 def train_test_split(
     88     X: np.array,
     89     y: np.array = None,
   (...)
     96     return_indices: bool = False,
     97 ):
     98     """Deterministic train_test_splitting of arbitrary arrays.
     99 
    100     Args:
   (...)
    112         np.array: X, y, and labels train/val/test data, or indices.
    113     """
--> 114     return train_val_test_split(
    115         X,
    116         y,
    117         labels,
    118         train_size,
    119         0,
    120         test_size,
    121         sampler,
    122         random_state,
    123         hopts,
    124         return_indices,
    125     )

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:69, in train_val_test_split(X, y, labels, train_size, val_size, test_size, sampler, random_state, hopts, return_indices)
     64 sampler_instance = sampler_factory.get_sampler(X, y, labels, hopts)
     66 if sampler in (*IMPLEMENTED_INTERPOLATION_SAMPLERS, "time_based"):
     67     # time_based does extrapolation but does not support random_state
     68     # because it always sorts in time order
---> 69     return _interpolative_sampling(
     70         sampler_instance,
     71         test_size,
     72         val_size,
     73         train_size,
     74         return_indices,
     75     )
     76 else:
     77     return _extrapolative_sampling(
     78         sampler_instance,
     79         test_size,
   (...)
     83         random_state,
     84     )

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:228, in _interpolative_sampling(sampler_instance, test_size, val_size, train_size, return_indices)
    223 test_idxs = sampler_instance.get_sample_idxs(n_test_samples)
    225 _check_actual_split(
    226     train_idxs, val_idxs, test_idxs, train_size, val_size, test_size
    227 )
--> 228 return _return_helper(
    229     sampler_instance, train_idxs, val_idxs, test_idxs, return_indices
    230 )

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:253, in _return_helper(sampler_instance, train_idxs, val_idxs, test_idxs, return_indices)
    240 """Convenience function to return the requested arrays appropriately.
    241 
    242 Args:
   (...)
    250     np.array: Either many arrays or indices in arrays.
    251 """
    252 out = []
--> 253 X_train = sampler_instance.X[train_idxs]
    254 out.append(X_train)
    255 if len(val_idxs):

File ~/mambaforge/lib/python3.11/site-packages/pandas/core/frame.py:3767, in DataFrame.__getitem__(self, key)
   3765     if is_iterator(key):
   3766         key = list(key)
-> 3767     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3769 # take() does not accept boolean indexers
   3770 if getattr(indexer, "dtype", None) == bool:

File ~/mambaforge/lib/python3.11/site-packages/pandas/core/indexes/base.py:5876, in Index._get_indexer_strict(self, key, axis_name)
   5873 else:
   5874     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5876 self._raise_if_missing(keyarr, indexer, axis_name)
   5878 keyarr = self.take(indexer)
   5879 if isinstance(key, Index):
   5880     # GH 42790 - Preserve name from an Index

File ~/mambaforge/lib/python3.11/site-packages/pandas/core/indexes/base.py:5935, in Index._raise_if_missing(self, key, indexer, axis_name)
   5933     if use_interval_msg:
   5934         key = list(key)
-> 5935     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5937 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   5938 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([   0,  106,  768, 1857, 1136,  925, 1276,  121, 1205, 1278,\n       ...\n        609,  893, 1738, 1661, 1590, 1630,  302, 1768, 1876,  952],\n      dtype='int64', length=1525)] are in the [columns]"

Expected behavior

The split goes well without error.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

python version 3.11
package versions: conda list or pip list 1.0.0
OS Arch Linux

Checklist

all dependencies are satisifed: conda list shows the packages listed in the README
the unit tests are working: pytest -v reports no errors

Additional context

Add any other context about the problem here that might help developers solve this.

AbstractSampler _sorted_cluster_counter

KMeans is currently holding the implementation of the Counter sorting -- this should probably be moved up into AbstractSampler since it may be the same for other clustering approaches.

[FEATURE]: Return pd.DataFrame if X is DataFrame after split

Is your feature request related to a problem? Please describe.

If X is a DataFrame with meaningful index, X_train and X_test are ndarrays without original index.

Use-cases/examples of this new feature

Return X_train, X_test as pd.DataFrame, y_train, y_test as pd.Series (or DataFrame, based on what y is).

Desired solution/workflow

Just like this package: https://github.com/yu9824/kennard_stone

Discussion

Additional context
Thank you!

SPXY

Add tests and implementation

Implement train_val_split

Implement a function, likely in astartes.py, that first calls train test split and then calls it again on the resulting training data to generate a validation set. train val split should accept three floats as the training, validation, and testing dataset split size.

OptiSim Sampler

Add tests and implementation

Switch the `random_split` implementation to use sklearn's `train_test_split` for better reproducibility

[FEATURE]: Morais-Lima-Martin (MLM) Sampler

Random-mutation variant of the Kennard-Stone algorithm: article link

Add 'Rules of Engagement'

Contributor guide in README, including a tutorial for installing the developer version
PR Template
Issue Template
Add development requirements (linting) to a pip install extra for developers, i.e. as shown here

Add CI for all major platforms and more Python versions

Take from this action in py2sambvca and then add main branch protection and merge rules.

Remove support for abstract featurization

Want to target cheminformatics in particular -- other users will already have a featurization pipeline and can pass their feature vectors through to train_test_split directtly.

Move train_test_split_molecules up a level (potentially into astartes.py) and delete the interface directory.

Separate Samplers into Supervised and Unsupervised

Add Bemis-Murcko scaffold splitting for chemicals in particular.

K-Means Split

Add tests and implementation

Sphere Exclusion

Add tests and implementation

[BUG]: from astartes import train_test_split: "ModuleNotFoundError: No module named 'astartes.samplers'

Describe the bug

Upon first startup, using the installation:
pip install --pre astartes

from astartes import train_test_split

Returns:
ModuleNotFoundError: No module named 'astartes.samplers'

Example(s)

Happens for
asteres import...
and
astartes.molecules import ...

Expected behavior

Expected import to happen

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

python version
package versions: conda list or pip list
OS

Checklist

all dependencies are satisifed: conda list shows the packages listed in the README
the unit tests are working: pytest -v reports no errors

Additional context
Python 3.11, 3.9
pytest -v does not pass

[FEATURE]: M-SPXY

See this paper.

[FEATURE]: Restricted Boltzmann Machine Sampler (RBM)

Add tests and implementation

[QUESTION]: Way to get the inputs back out from train_test_split_molecules?

What are you trying to do?

using train_test_split_molecules, I want to get back both indices and Morgan fingerprint

I put in my list of SMILES strings, which then get converted into molecule descriptors, and split, and get back out X_train and X_test, giving back two arrays of Morgan fingerprints.

Is there a tag I'm missing for which I can get the SMILES strings back?

Previous attempts

I put in my list of SMILES strings, which then get converted into molecule descriptors, and split, and get back out X_train and X_test, giving back two arrays of Morgan fingerprints. I can use the return_indices = True to get the indices, but then I lose the Morgan Fingerprint conversion.

Is there a tag I'm missing for which I can get both?

jacksonburns / astartes Goto Github PK

astartes's Introduction

Welcome to my GitHub!

astartes's People

Contributors

Stargazers

Watchers

Forkers

astartes's Issues

Recommend Projects

Recommend Topics

Recommend Org