kjappelbaum / mofdscribe Goto Github PK

An ecosystem for digital reticular chemistry

Home Page: https://mofdscribe.readthedocs.io/en/latest/

License: MIT License

Python 68.98% HTML 31.02%

artificial-intelligence descriptors featurization machine-learning mof porous-materials benchmark data-science metrics reticular-chemistry

mofdscribe's Introduction

Featurizing metal-organic frameworks (MOFs) made simple! This package builds on the power of matminer to make featurization of MOFs as easy as possible. Now, you can use features that are mostly used for porous materials in the same way as all other matminer featurizers. mofdscribe additionally includes routines that help with model validation.

💪 Getting Started

from mofdscribe.featurizers.chemistry import RACS
from pymatgen.core import Structure

structure = Structure.from_file(<my_cif.cif>)
featurizer = RACS()
racs_features = featurizer.featurize(structure)

🚀 Installation

While we are in the process of trying to make mofdscribe work on all operating system (we're waiting for conda recipies getting merged), it is currently not easy on Windows (and there might be potential issues on ARM-based Macs). For this reason, we recommend installing mofdscribe on a UNIX machine.

To install in development mode, use the following:

git clone git+https://github.com/kjappelbaum/mofdscribe.git
cd mofdscribe
pip install -e .

if you want to use all utilities, you can use the all extra: pip install -e ".[all]"

We depend on many other external tools. Most external tools are automatically installed if you install mofdscribe via conda:

conda install -c conda-forge mofdscribe

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.rst for more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License.

📖 Citation

See the ChemRxiv preprint.

@article{Jablonka_2022,
    doi = {10.26434/chemrxiv-2022-4g7rx},
    url = {https://doi.org/10.26434%2Fchemrxiv-2022-4g7rx},
    year = 2022,
    month = {sep},
    publisher = {American Chemical Society ({ACS})},
    author = {Kevin Maik Jablonka and Andrew S. Rosen and Aditi S. Krishnapriyan and Berend Smit},
    title = {An ecosystem for digital reticular chemistry}
}

💰 Funding

The research was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 666983, MaGic), by the NCCR-MARVEL, funded by the Swiss National Science Foundation, and by the Swiss National Science Foundation (SNSF) under Grant 200021_172759.

🍪 Cookiecutter

This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.

🛠️ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making a code contribution.

❓ Testing

After cloning the repository and installing tox with pip install tox, the unit tests in the tests/ folder can be run reproducibly with:

tox

Additionally, these tests are automatically re-run with each commit in a GitHub Action.

📦 Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

tox -e finish

This script does the following:

Uses BumpVersion to switch the version number in the setup.cfg and src/mofdscribe/version.py to not have the -dev suffix
Packages the code in both a tar archive and a wheel
Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion minor after.

mofdscribe's People

Contributors

Stargazers

Watchers

Forkers

yeonghun1675 hspark1212 fmggggg fmcil nexuszzz xiaoqzhang dhruvthakur2000

mofdscribe's Issues

datasets are currently quite small - probably limited by the dates

how handle datasets that might not fit into memory

if i implement the splitting methods in the base class we assume (using sklearn's train/test split) that the datasets fit into memory.

However, this might not be the case for all datasets in the future.

One option would be that those future datasets override methods that might cause running out of memory. Other option might be to have different base classes and the default one would be a InMemoryDataset

implement point-cloud descriptors

there has been a lot of work on hand-crafting features for 3D point clouds.
Needs a bit of thought of how to best apply for periodic systems (but in worst case, consistent pre-processing could fix it).

e.g. https://python-pcl-fork.readthedocs.io/en/rc_patches4/tutorial/features.html

release mol-tda in pypi

add hashes and basenames under consistent properties to datasets

this would be important to exclude structures form a training set

implement widom Henry coefficient and HoA

implement periodic TDA

instead of the Euclidean distance we can use the periodic image distance?

add regression tests

for featurizers where there is an official implementation we should add a regression test to make sure we get the same results

implement eigencages

https://pubs.acs.org/doi/full/10.1021/acscentsci.8b00638

implement site dataset

perform better input validation

make sure structures are structures
make sure that iterables are iterables

running phvect on all CoRE causes issues

it skips some rows

do not rely on dict order

in some cases we rely on the dict order to have matching features and feature_labels.

we should remove this implicit assumption, e.g., by simply calling the entries in the dict using the output of feature_labels. This would make sure that they are really matching.

refactor `get_neighbors_at_distance`

The code i have at the moment is stupid.

atom-centered persistent homology

inspired by https://doi.org/10.1038/s41524-021-00493-w

the difficulty here is that if we want to go via the vectors or images we would need to fit

(PYL-W0622) Re-definition found for builtin function

Description

Defining a local variable or function with the same name as a built-in object makes the built-in object unusable within the current scope and makes the code prone to bugs.

Occurrences

There is 1 occurrence of this issue in the repository.

See all occurrences on DeepSource → deepsource.io/gh/kjappelbaum/mofdscribe/issue/PYL-W0622/occurrences/

Basic architecture / requirements

Random dump of thoughts:

I think the library should be centered around a MOF class that has different descriptors as cached properties.
Many descriptors rely on structure graphs, those should only be computed once.
There should also be a basic CLI to featurize one or a folder of CIFs
We should re-implement anything that already is implemented in matminer and we should also not couple it such that we need to update the package all the time if they change the API of matminer
We at least want to have the following descriptors built in:
- RACs, but we should rely on molsimplify as it is hard to install and we can probably just extract the main ideas into this package (problem is also that it is GPL licensed)
- Pore properties with zeo++ (that means we need to also make it available via conda)
- persistent homology fingerprints: images and barcodes
- energy histogram from bucior/snurr
- property labeled RDF
- the local-structure order parameters
- basic summary of chemistry (perhaps also split into linker/node/...)
- SOAP
The architecture should also keep in mind that we might want to add descriptor based on the building blocks at some point, so the design should allow for making this easy

implement diversity metrics

might be nice to implement basic routines to score the "diversity" of datasets under different featurizations
-> would also nicely highlight how the diversity depends on the featurization

include some helpers for proper train/test split?

implement CLI

implement partialcharge histogram

energy grid

probably, we should not implement it ourselves in python for now?
maybe we just write a parser for RASPA -- then we could also use a short run of widom insertions as descriptor at some point.

good thing is that there seems to be a conda channel for raspa

when we fit `PHVect` do we want to skip empty persistence diagrams?

implement PH image

add user guide for datasets

check that energygrid uses local forcefield

implement `fit` and `fit_transform` on `energygrid`

the motivation for doing this would be to get the extrema for the histograms or, in some other way, get a better binning

implement site featurizers

perhaps the one used for the oximachine

logo and name

implement pore-centered PH featurizer

in principle, this shouldn't be too hard:

We get the atoms that form the pore(s) - atm, not so clear if, for instance, zeo++ could directly return this
If needed, we unfold the PBC - challenge here is, of course that for channels we could replicate forever. But to get a signal, one full periodic replica should be enough
Then run the TDA

PH stats fingerprint (not atom-centered)

There is no reason why we only should compute statistics to derived a fixed-length descriptor from atom-centered persistent homology analysis

perhaps run all reference structures through `manage_crystal`?

address why not PR(s) to matminer at this moment

allow to interface to RDKit to derive BuildingBlockFeatures

it should probably be a BBFeaturizer that takes in what bbs one wants to featurize (linker/nodes) and that takes a list of tuples
with feature names and [([feature names], function(rdkit_mol))] as input.

The featurize then just calls the fragmentor, the featurization functions and then can aggregate them in different ways.

add maintainer documentation on how to update datasets

the first step is to have some docs that describe the current workflow.

after that, we should automate as much as possible and provide scripts for updating the datasets

implement racs

implement some clever stratification automation

something like this would probably be the main advantage of a train/test split function implemented in this package

implement PH Barcode

implement the `fit` and `fit_transform` methods for the energygrid

the optimum bounds depend quite a bit on the guest, we should allow to fit and fit_transform

implement graph-spectral features

(PTC-W0030) Test APRDF

Description

Do not pollute your project with empty files. It is recommended to either delete this file or drop some documentation in it explaining why it is there.

Occurrences

There is 1 occurrence of this issue in the repository.

See all occurrences on DeepSource → deepsource.io/gh/kjappelbaum/mofdscribe/issue/PTC-W0030/occurrences/

zeo++ and raspa on m1 mac

add more datasets

see https://github.com/SimonEnsemble/porous-material-AI-gym/commits/main

we can easily add the Curated COFs
perhaps also add some about the synthesis conditions

record versions of dependencies

I just realized that depending on the zeo++ version on can get quite different results.
the library should help the user keeping track of that

implement zeo++ ray features

(PYL-E0102) Function or method is being redefined

Description

A function, method or class is being redefined in the same scope. This would override the original definition, and doing this is strongly discouraged. Please verify that this is something that you intended to do.

Occurrences

There is 1 occurrence of this issue in the repository.

function already defined line 91

src/mofdscribe/topology/_tda_helpers.py

See all occurrences on DeepSource → deepsource.io/gh/kjappelbaum/mofdscribe/issue/PYL-E0102/occurrences/

refactor RACs featurizer to site-based featurizer

the aggregations are not a must - we should have the flexibility to use the different parts (linker, atom/centered) in individual models, e.g., for Behler-Parinello architecture

vectorize atom-centered PH using images and/or Gaussian mixtures

currently, we only featurize via simple statistics. One could clearly generalize this.

how do we handle `fit_transform` for `featurize_many`

I think that matminer just throws an error if one step that needs to be fitted is not fitted.
In their case, however, "fitting" is often cheaper as it involves only looking up what elements occur (even though this is quite some I/O cost).
Here, we already need to do much of the work for fitting - therefore, fit_transform is much more useful

expose cifs with charges

in case people want to compute their own labels

kjappelbaum / mofdscribe Goto Github PK

mofdscribe's Introduction

💪 Getting Started

🚀 Installation

👐 Contributing

👋 Attribution

⚖️ License

📖 Citation

💰 Funding

🍪 Cookiecutter

🛠️ For Developers

❓ Testing

📦 Making a Release

mofdscribe's People

Contributors

Stargazers

Watchers

Forkers

mofdscribe's Issues

Description

Occurrences

Description

Occurrences

Description

Occurrences

Recommend Projects

Recommend Topics

Recommend Org