Giter Site home page Giter Site logo

kjappelbaum / mofdscribe Goto Github PK

View Code? Open in Web Editor NEW
43.0 4.0 7.0 12.43 MB

An ecosystem for digital reticular chemistry

Home Page: https://mofdscribe.readthedocs.io/en/latest/

License: MIT License

Python 68.98% HTML 31.02%
artificial-intelligence descriptors featurization machine-learning mof porous-materials benchmark data-science metrics reticular-chemistry

mofdscribe's Introduction

Tests PyPI PyPI - Python Version PyPI - License Documentation Status Code style: black Matsci Commitizen friendly

Featurizing metal-organic frameworks (MOFs) made simple! This package builds on the power of matminer to make featurization of MOFs as easy as possible. Now, you can use features that are mostly used for porous materials in the same way as all other matminer featurizers. mofdscribe additionally includes routines that help with model validation.

πŸ’ͺ Getting Started

from mofdscribe.featurizers.chemistry import RACS
from pymatgen.core import Structure

structure = Structure.from_file(<my_cif.cif>)
featurizer = RACS()
racs_features = featurizer.featurize(structure)

πŸš€ Installation

While we are in the process of trying to make mofdscribe work on all operating system (we're waiting for conda recipies getting merged), it is currently not easy on Windows (and there might be potential issues on ARM-based Macs). For this reason, we recommend installing mofdscribe on a UNIX machine.

To install in development mode, use the following:

git clone git+https://github.com/kjappelbaum/mofdscribe.git
cd mofdscribe
pip install -e .

if you want to use all utilities, you can use the all extra: pip install -e ".[all]"

We depend on many other external tools. Most external tools are automatically installed if you install mofdscribe via conda:

conda install -c conda-forge mofdscribe

πŸ‘ Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.rst for more information on getting involved.

πŸ‘‹ Attribution

βš–οΈ License

The code in this package is licensed under the MIT License.

πŸ“– Citation

See the ChemRxiv preprint.

@article{Jablonka_2022,
    doi = {10.26434/chemrxiv-2022-4g7rx},
    url = {https://doi.org/10.26434%2Fchemrxiv-2022-4g7rx},
    year = 2022,
    month = {sep},
    publisher = {American Chemical Society ({ACS})},
    author = {Kevin Maik Jablonka and Andrew S. Rosen and Aditi S. Krishnapriyan and Berend Smit},
    title = {An ecosystem for digital reticular chemistry}
}

πŸ’° Funding

The research was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 666983, MaGic), by the NCCR-MARVEL, funded by the Swiss National Science Foundation, and by the Swiss National Science Foundation (SNSF) under Grant 200021_172759.

πŸͺ Cookiecutter

This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.

πŸ› οΈ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making a code contribution.

❓ Testing

After cloning the repository and installing tox with pip install tox, the unit tests in the tests/ folder can be run reproducibly with:

tox

Additionally, these tests are automatically re-run with each commit in a GitHub Action.

πŸ“¦ Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

tox -e finish

This script does the following:

  1. Uses BumpVersion to switch the version number in the setup.cfg and src/mofdscribe/version.py to not have the -dev suffix
  2. Packages the code in both a tar archive and a wheel
  3. Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
  4. Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
  5. Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion minor after.

mofdscribe's People

Contributors

andrew-s-rosen avatar deepsource-autofix[bot] avatar deepsourcebot avatar fmcil avatar kjappelbaum avatar ml-evs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mofdscribe's Issues

how handle datasets that might not fit into memory

if i implement the splitting methods in the base class we assume (using sklearn's train/test split) that the datasets fit into memory.

However, this might not be the case for all datasets in the future.

One option would be that those future datasets override methods that might cause running out of memory. Other option might be to have different base classes and the default one would be a InMemoryDataset

add regression tests

for featurizers where there is an official implementation we should add a regression test to make sure we get the same results

do not rely on dict order

in some cases we rely on the dict order to have matching features and feature_labels.

we should remove this implicit assumption, e.g., by simply calling the entries in the dict using the output of feature_labels. This would make sure that they are really matching.

Basic architecture / requirements

Random dump of thoughts:

  • I think the library should be centered around a MOF class that has different descriptors as cached properties.

  • Many descriptors rely on structure graphs, those should only be computed once.

  • There should also be a basic CLI to featurize one or a folder of CIFs

  • We should re-implement anything that already is implemented in matminer and we should also not couple it such that we need to update the package all the time if they change the API of matminer

  • We at least want to have the following descriptors built in:

    • RACs, but we should rely on molsimplify as it is hard to install and we can probably just extract the main ideas into this package (problem is also that it is GPL licensed)
    • Pore properties with zeo++ (that means we need to also make it available via conda)
    • persistent homology fingerprints: images and barcodes
    • energy histogram from bucior/snurr
    • property labeled RDF
    • the local-structure order parameters
    • basic summary of chemistry (perhaps also split into linker/node/...)
    • SOAP
  • The architecture should also keep in mind that we might want to add descriptor based on the building blocks at some point, so the design should allow for making this easy

implement diversity metrics

might be nice to implement basic routines to score the "diversity" of datasets under different featurizations
-> would also nicely highlight how the diversity depends on the featurization

energy grid

probably, we should not implement it ourselves in python for now?
maybe we just write a parser for RASPA -- then we could also use a short run of widom insertions as descriptor at some point.

good thing is that there seems to be a conda channel for raspa

implement pore-centered PH featurizer

in principle, this shouldn't be too hard:

  1. We get the atoms that form the pore(s) - atm, not so clear if, for instance, zeo++ could directly return this
  2. If needed, we unfold the PBC - challenge here is, of course that for channels we could replicate forever. But to get a signal, one full periodic replica should be enough
  3. Then run the TDA

allow to interface to RDKit to derive BuildingBlockFeatures

it should probably be a BBFeaturizer that takes in what bbs one wants to featurize (linker/nodes) and that takes a list of tuples
with feature names and [([feature names], function(rdkit_mol))] as input.

The featurize then just calls the fragmentor, the featurization functions and then can aggregate them in different ways.

record versions of dependencies

I just realized that depending on the zeo++ version on can get quite different results.
the library should help the user keeping track of that

(PYL-E0102) Function or method is being redefined

Description

A function, method or class is being redefined in the same scope. This would override the original definition, and doing this is strongly discouraged. Please verify that this is something that you intended to do.

Occurrences

There is 1 occurrence of this issue in the repository.

function already defined line 91

src/mofdscribe/topology/_tda_helpers.py

See all occurrences on DeepSource β†’ deepsource.io/gh/kjappelbaum/mofdscribe/issue/PYL-E0102/occurrences/

how do we handle `fit_transform` for `featurize_many`

I think that matminer just throws an error if one step that needs to be fitted is not fitted.
In their case, however, "fitting" is often cheaper as it involves only looking up what elements occur (even though this is quite some I/O cost).
Here, we already need to do much of the work for fitting - therefore, fit_transform is much more useful

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.