Giter Site home page Giter Site logo

cmlkit's Introduction

Hello! ๐Ÿ‘‹

I'm Marcel, a postdoc in the COSMO Lab at EPFL in Lausanne, Switzerland. Previously, I was a doctoral student at the Fritz Haber Institute (in the NOMAD Laboratory) and TU Berlin (in the Machine Learning group).

I work on applying machine learning to materials science problems. Recently, I've particularly focused on machine-learning potentials and the computation of thermal conductivities with the Green-Kubo method. To this end, I've written the glp and gkx packages, both using jax.

Here are some other places on the internet you can have a look at:

Thanks for stopping by ๐Ÿš€

cmlkit's People

Contributors

sirmarcel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cmlkit's Issues

Drop yaml dumper for son, replace with json

JSON is comically faster. Here is a benchmark for dumping/loading 2000 small dicts (3 repeats):

JSON:
{'times': array([0.38685818, 0.34857985, 0.34886317]), 'mean': 0.36143373133333334, 'min': 0.3485798470000001, 'max': 0.3868581750000001}
Yaml:
{'times': array([ 9.26613551, 10.8498244 ,  9.88484843]), 'mean': 10.000269449333334, 'min': 9.266135509, 'max': 10.849824405}

This really hurts when doing Run.checkout().

Edge case in Dataset.compute_dataset_info

This is a minor concern, especially if Dataset is being phased out. When creating a Dataset with only single-atom geometries, dists is a list of empty arrays and min_distance and max_distance fail. A real-world example of this might be energy-volume curves with FCC primitive cells.

dists = [
qmmlpack.lower_triangular_part(qmmlpack.distance_euclidean(rr), -1) for rr in r
]
i["min_distance"] = min([min(d) for d in dists if len(d) > 0])
i["max_distance"] = max([max(d) for d in dists if len(d) > 0])

Toy example:

from ase import Atoms
from cmlkit.dataset import Dataset

geometries = [Atoms('Au', cell=[3, 3, 3], pbc=True),
              Atoms('Au', cell=[4, 4, 4], pbc=True),
              Atoms('Au', cell=[5, 5, 5], pbc=True),]
Dataset.from_Atoms(geometries)

ValueError                                Traceback (most recent call last)
<ipython-input-64-25473c66c8a8> in <module>
      5               Atoms('Au', cell=[3, 3, 3], pbc=True),
      6               Atoms('Au', cell=[4, 4, 4], pbc=True),]
----> 7 Dataset.from_Atoms(geometries)

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in from_Atoms(cls, atoms, p, name, desc, splits)
    288             name=name,
    289             desc=desc,
--> 290             splits=splits,
    291         )
    292 

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in __init__(self, z, r, b, p, name, desc, splits, _info, _hash, _geom_hash)
    169             self.info = _info
    170         else:
--> 171             self.info = self.get_info()
    172 
    173         # compute auxiliary info that we need to convert properties

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in get_info(self)
    218     def get_info(self):
    219         """Compute information on dataset."""
--> 220         return compute_dataset_info(self)
    221 
    222     def get_hash(self):

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in compute_dataset_info(dataset)
    549         qmmlpack.lower_triangular_part(qmmlpack.distance_euclidean(rr), -1) for rr in r
    550     ]
--> 551     i["min_distance"] = min([min(d) for d in dists if len(d) > 0])
    552     i["max_distance"] = max([max(d) for d in dists if len(d) > 0])
    553 

ValueError: min() arg is an empty sequence

Run context should also be on tape?

title says it all.

Upsides:

  • Once you ran prepare(), restore and run will always do (approximately) the same
  • If you encounter a tape.son in the wild, the information in it is slightly more complete

Downsides:

  • Need additional functionality to overwrite the context in restore if required
  • More complex restore

Also, it might be worth thinking about including more general (write-only) metadata on the tape.

Hash stability failure in `Dataset` with newer `numpy`

test_hash_stable in tests/test_dataset.py has started failing when numpy is upgraded somewhere above 1.16.5. Most likely this is due subtly different treatment of object arrays. Luckily, they should become obsolete once we transition to the new Data backend, so I'll close this issue then. Until then, it'll remain open in case anyone else notices!

Undefined behaviour in Dataset when filename is specified but not name

When a Dataset is created with no name but saved with a filename, references to that filename are ignored by certain other Components like tune.Run using a tune.TuneEvaluatorHoldout that was initialized with filenames for the train/test kwargs. Interestingly, EvaluatorHoldout.evaluate() does not suffer, while Run.run() does.

`tid`s for `quippy` interface appear to not be always unique

Occasionally execution halts with file ready exists due to the quippy scratch folder tid not being unique. This should not happen. It Should Be Tested whether this is actually due to not enough entropy in generating these + the timing being exactly the same, or whether there is a more fundamental problem (which seems a lot more likely.)

If it's actually a randomness problem, fixes could be:

  • Switching to time_ns
  • Adding geom_hash
  • Adding a context-supplied random seed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.