Giter Site home page Giter Site logo

jacksonburns / astartes Goto Github PK

View Code? Open in Web Editor NEW
52.0 3.0 2.0 20.89 MB

Better Data Splits for Machine Learning

Home Page: https://jacksonburns.github.io/astartes/

License: MIT License

Python 100.00%
ai ml data-science machine-learning sampling python

astartes's Introduction

Welcome to my GitHub!

I dug this comment up while attempting to compile an ancient computational chemistry program written in Fortran 77:

The following notes are discoveries made while attempting to figure out how this mess works. They are here so that you, poor soul that you are, don't have to reinvent the wheel (well, at least not the whole thing).

Hopefully my projects don't cause you anywhere near this level of frustration.

Jackson Burns's GitHub stats

astartes's People

Contributors

himaghna avatar jacksonburns avatar kspieks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

himaghna hfooladi

astartes's Issues

[BUG]: Repository Configuration Issues

Describe the bug

Two small things:

  1. The "Question" issue template still says Chemprop in the title
  2. The Update JOSS Paper action should be run on pushes to main in addition to its current configuration

[BUG]: Scaffold sampler inaccessible from `train_(val_)test_split_molecules`

Describe the bug

The astartes.molecules functions for splitting automatically featurize the input SMILES, which raises exceptions when attempting to use the scaffold sampler.

Expected behavior

train_cal_test_split_molecules should skip the featurization with AIMSim step when the scaffold sampler is used.

[FEATURE]: Add some randomness to split assignments of clusters

Is your feature request related to a problem? Please describe.

Currently the clusters resulting from an extrapolative sampling algorithm are sorted from smallest to largest and assigned into the testing, validation, and training sets without any randomness being possible in the assignments. This is fine, but not useful for stratified sampling.

Use-cases/examples of this new feature

Stratified sampling will require some randomness in order to function, and this will enable us to make random_state a kwarg in train_val_test_split and train_test_split which is nice for interoperability (we can do this now, but not before, since now all samplers (except KS) will use random_state).

Desired solution/workflow

Shuffle the small (less than the size of the test and validation splits) clusters and then assign them.

Potentially add factory

One idea to avoid these many if statements link is to add a factory. An example is the ESS adapter and factory in RMG-Py's Arkane so at the bottom of each adapter we just add a line to register the adapter link then we could just do something like _registered_adapters[sampler_name] rather than a long chain of if, elif, elif, elif, elif, elif etc

[FEATURE]: Mahalanobis Distance Kennard-Stone (MDKS) Sampler

Is your feature request related to a problem? Please describe.

Better interpolative splits for Artificial Neural Networks in particular with Mahalanobis Distance Kennard-Stone here.

Use-cases/examples of this new feature

See linked paper for specific examples, but is reported to generally provide data splits for ANN applications.

Desired solution/workflow

Using the base implementation of the Kennard-Stone sampler, implement this.

Discussion

Unforunately without re-writing the Mahalanobis distance method to accept pre-computed pairwise distances this method will scale at least O(n^3)

README Update

  • remove section about JOSS
  • revise section about interfaces to be only about molecules
  • create a table of implemented methods, as well as the source repositories where the code was retrieved from or where the dependency is (some may need to be manually copied into astartes) if not installable via PyPI

[FEATURE]: Put in rdkit objects directly

Is your feature request related to a problem? Please describe.

I have a bunch of adsorbed species, and can make then directly into rdkit objects fairly easily. However, to make them into astartes/AIMSIM inputs, I convert it back into a smiles string, and then input it in.

Use-cases/examples of this new feature

Periodic DFT adsorbed structure can be approximated with a chunk of the surface the adsorbate is put on + the adsorbate; this gives very complicated SMILES strings, plus rdkit has to sanitize them.
I can get my geometries into rdkit format just the way I need to, but converting it to a SMILES string is a roundabout way to to analytics on it.

Desired solution/workflow

Unfortunately I have no idea what pseudocode would do this, but being able to put in a list of rdkit objects as input would be massively helpful.

Discussion

Additional context

Thank you!

Add Tests for Random Sampler

Random sampler has already been implemented, need to test that it is working with unittests. Will require fixing the random seed when calling the sampler.

[BUG]: Outdated short description in `pyproject.toml`

Describe the bug

The pyproject.toml still refers to support for molecules, images, and arbitrary vectors but images have since been removed.

Expected behavior

Remove this so that the package description will show correctly on PyPI.

[BUG]: Error instantiating train_test_split_molecules()

Describe the bug

Using the example train_test_split_molecules() settings in the readme, but commenting out the y=y (just want to get out group splits), I get the following error:
TypeError: train_test_split_molecules() got an unexpected keyword argument 'splitter'

Example(s)

smiles = list(open('smiles.txt')

train_test_split_molecules(
smiles=smiles,
# y=y,
test_size=0.2,
train_size=0.8,
fingerprint="daylight_fingerprint",
fprints_hopts={
    "minPath": 2,
    "maxPath": 5,
    "fpSize": 200,
    "bitsPerHash": 4,
    "useHs": 1,
    "tgtDensity": 0.4,
    "minSize": 64,
},
splitter="random",
hopts={
    "random_state": 42,
    "shuffle": True,
},
)

Expected behavior

Expected train_test_split instance

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • 3.11
  • package versions: aimsim-2.0.2, astartes 1.0.0b2
  • Windows

Checklist

  • the unit tests are working: pytest -v reports no errors: Error collecting test session

Additional context

[FEATURE]: Set a `DEFAULT_RANDOM_SEED` in `main.py`

Is your feature request related to a problem? Please describe.

The use of a random seed is not unified across astartes

Use-cases/examples of this new feature

With this in place, we could exactly reproduce the sampling results between subsequent calls of astartes. This would mean that even the non-deterministic samplers would, by default, do the same thing everytime like the deterministic ones do.

Desired solution/workflow

Set DEFAULT_RANDOM_SEED=42 in main.py, override it in train_val_test_split if the user provides it, and then everywhere else in astartes that uses randomness should import from astartes import DEFAULT_RANDOM_SEED.

[FEATURE]: Multi-Fidelity Sampling

Is your feature request related to a problem? Please describe.

Given a dataset with both high and low fidelity data points, partition the dataset based on the fidelity (extrapolatively).

Switch packaging to `pyproject.toml`

setuptools seems to be moving in this direction - we might as well get ahead of the curve. With this issue, we can also add the optional pip extra install 'molecules' which will give the user access to the molecules interface. We want to avoid including this in the base install since it is very version strict and may interfere with other development environments.

Fix PR Review Action

The PR review GitHub action currently annoys the user after every push. Borrow from this action in py2sambvca to make it only leave comments after the PR is marked ready for review.

Discussion about standardizing data

Typical best practices in data science (as well as this example from sklearn) indicate that feature vectors should normally be z-scored before entering the DBSCAN algorithm. This seems sensible and more generally should probably be done before any distance or similarity metric is applied to create clusters for subsequent data splitting. Perhaps a future PR could add the option to z-score the data. However, I'm also fine if we leave this to the user since I believe the goal of this repo is just to create data splits. To make this generalizable beyond cheminformatics, I think we're intentionally offloading the choice of featurization to users who are subject matter experts in their field and probably know more than we do about how to best represent their problem; for example, maybe some features are more important and z-scoring would obfuscate that idk

[BUG]: train_test_split get KeyError when X is pd.DataFrame

Describe the bug

train_test_split get KeyError when X is pd.DataFrame

Example(s)

from astartes import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
  X,
  y,
  sampler = 'kennard_stone',
)

get:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File <timed exec>:4

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:114, in train_test_split(X, y, labels, train_size, test_size, sampler, random_state, hopts, return_indices)
     87 def train_test_split(
     88     X: np.array,
     89     y: np.array = None,
   (...)
     96     return_indices: bool = False,
     97 ):
     98     """Deterministic train_test_splitting of arbitrary arrays.
     99 
    100     Args:
   (...)
    112         np.array: X, y, and labels train/val/test data, or indices.
    113     """
--> 114     return train_val_test_split(
    115         X,
    116         y,
    117         labels,
    118         train_size,
    119         0,
    120         test_size,
    121         sampler,
    122         random_state,
    123         hopts,
    124         return_indices,
    125     )

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:69, in train_val_test_split(X, y, labels, train_size, val_size, test_size, sampler, random_state, hopts, return_indices)
     64 sampler_instance = sampler_factory.get_sampler(X, y, labels, hopts)
     66 if sampler in (*IMPLEMENTED_INTERPOLATION_SAMPLERS, "time_based"):
     67     # time_based does extrapolation but does not support random_state
     68     # because it always sorts in time order
---> 69     return _interpolative_sampling(
     70         sampler_instance,
     71         test_size,
     72         val_size,
     73         train_size,
     74         return_indices,
     75     )
     76 else:
     77     return _extrapolative_sampling(
     78         sampler_instance,
     79         test_size,
   (...)
     83         random_state,
     84     )

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:228, in _interpolative_sampling(sampler_instance, test_size, val_size, train_size, return_indices)
    223 test_idxs = sampler_instance.get_sample_idxs(n_test_samples)
    225 _check_actual_split(
    226     train_idxs, val_idxs, test_idxs, train_size, val_size, test_size
    227 )
--> 228 return _return_helper(
    229     sampler_instance, train_idxs, val_idxs, test_idxs, return_indices
    230 )

File ~/mambaforge/lib/python3.11/site-packages/astartes/main.py:253, in _return_helper(sampler_instance, train_idxs, val_idxs, test_idxs, return_indices)
    240 """Convenience function to return the requested arrays appropriately.
    241 
    242 Args:
   (...)
    250     np.array: Either many arrays or indices in arrays.
    251 """
    252 out = []
--> 253 X_train = sampler_instance.X[train_idxs]
    254 out.append(X_train)
    255 if len(val_idxs):

File ~/mambaforge/lib/python3.11/site-packages/pandas/core/frame.py:3767, in DataFrame.__getitem__(self, key)
   3765     if is_iterator(key):
   3766         key = list(key)
-> 3767     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3769 # take() does not accept boolean indexers
   3770 if getattr(indexer, "dtype", None) == bool:

File ~/mambaforge/lib/python3.11/site-packages/pandas/core/indexes/base.py:5876, in Index._get_indexer_strict(self, key, axis_name)
   5873 else:
   5874     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5876 self._raise_if_missing(keyarr, indexer, axis_name)
   5878 keyarr = self.take(indexer)
   5879 if isinstance(key, Index):
   5880     # GH 42790 - Preserve name from an Index

File ~/mambaforge/lib/python3.11/site-packages/pandas/core/indexes/base.py:5935, in Index._raise_if_missing(self, key, indexer, axis_name)
   5933     if use_interval_msg:
   5934         key = list(key)
-> 5935     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5937 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   5938 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([   0,  106,  768, 1857, 1136,  925, 1276,  121, 1205, 1278,\n       ...\n        609,  893, 1738, 1661, 1590, 1630,  302, 1768, 1876,  952],\n      dtype='int64', length=1525)] are in the [columns]"

Expected behavior

The split goes well without error.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • python version 3.11
  • package versions: conda list or pip list 1.0.0
  • OS Arch Linux

Checklist

  • all dependencies are satisifed: conda list shows the packages listed in the README
  • the unit tests are working: pytest -v reports no errors

Additional context

Add any other context about the problem here that might help developers solve this.

AbstractSampler _sorted_cluster_counter

KMeans is currently holding the implementation of the Counter sorting -- this should probably be moved up into AbstractSampler since it may be the same for other clustering approaches.

[FEATURE]: Return pd.DataFrame if X is DataFrame after split

Is your feature request related to a problem? Please describe.

If X is a DataFrame with meaningful index, X_train and X_test are ndarrays without original index.

Use-cases/examples of this new feature

Return X_train, X_test as pd.DataFrame, y_train, y_test as pd.Series (or DataFrame, based on what y is).

Desired solution/workflow

Just like this package: https://github.com/yu9824/kennard_stone

Discussion

Additional context
Thank you!

SPXY

Add tests and implementation

Implement train_val_split

Implement a function, likely in astartes.py, that first calls train test split and then calls it again on the resulting training data to generate a validation set. train val split should accept three floats as the training, validation, and testing dataset split size.

Add 'Rules of Engagement'

  • Contributor guide in README, including a tutorial for installing the developer version
  • PR Template
  • Issue Template
  • Add development requirements (linting) to a pip install extra for developers, i.e. as shown here

Remove support for abstract featurization

Want to target cheminformatics in particular -- other users will already have a featurization pipeline and can pass their feature vectors through to train_test_split directtly.

Move train_test_split_molecules up a level (potentially into astartes.py) and delete the interface directory.

[BUG]: from astartes import train_test_split: "ModuleNotFoundError: No module named 'astartes.samplers'

Describe the bug

Upon first startup, using the installation:
pip install --pre astartes

from astartes import train_test_split

Returns:
ModuleNotFoundError: No module named 'astartes.samplers'

Example(s)

Happens for
asteres import...
and
astartes.molecules import ...

Expected behavior

Expected import to happen

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • python version
  • package versions: conda list or pip list
  • OS

Checklist

  • all dependencies are satisifed: conda list shows the packages listed in the README
  • the unit tests are working: pytest -v reports no errors

Additional context
Python 3.11, 3.9
pytest -v does not pass

[QUESTION]: Way to get the inputs back out from train_test_split_molecules?

What are you trying to do?

using train_test_split_molecules, I want to get back both indices and Morgan fingerprint

I put in my list of SMILES strings, which then get converted into molecule descriptors, and split, and get back out X_train and X_test, giving back two arrays of Morgan fingerprints.

Is there a tag I'm missing for which I can get the SMILES strings back?

Previous attempts

I put in my list of SMILES strings, which then get converted into molecule descriptors, and split, and get back out X_train and X_test, giving back two arrays of Morgan fingerprints. I can use the return_indices = True to get the indices, but then I lose the Morgan Fingerprint conversion.

Is there a tag I'm missing for which I can get both?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.