mala-project / mala Goto Github PK

View Code? Open in Web Editor NEW

77.0 10.0 24.0 57.11 MB

Materials Learning Algorithms. A framework for machine learning materials properties from first-principles data.

Home Page: https://mala-project.github.io/mala/

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.09% Python 96.27% Shell 0.20% Fortran 3.43%

machine-learning dft density-functional-theory electronic-structure neural-network

mala's Introduction

MALA

MALA (Materials Learning Algorithms) is a data-driven framework to generate surrogate models of density functional theory calculations based on machine learning. Its purpose is to enable multiscale modeling by bypassing computationally expensive steps in state-of-the-art density functional simulations.

MALA is designed as a modular and open-source python package. It enables users to perform the entire modeling toolchain using only a few lines of code. MALA is jointly developed by the Sandia National Laboratories (SNL) and the Center for Advanced Systems Understanding (CASUS). See Contributing for contributing code to the repository.

This repository is structured as follows:

├── examples : contains useful examples to get you started with the package
├── install : contains scripts for setting up this package on your machine
├── mala : the source code itself
├── test : test scripts used during development, will hold tests for CI in the future
└── docs : Sphinx documentation folder

Installation

WARNING: Even if you install MALA via PyPI, please consult the full installation instructions afterwards. External modules (like the QuantumESPRESSO bindings) are not distributed via PyPI!

Please refer to Installation of MALA.

Running

You can familiarize yourself with the usage of this package by running the examples in the example/ folder.

Contributors

MALA is jointly maintained by

Sandia National Laboratories (SNL), USA.
- Scientific supervisor: Sivasankaran Rajamanickam, code maintenance: Jon Vogel
Center for Advanced Systems Understanding (CASUS), Germany.
- Scientific supervisor: Attila Cangi, code maintenance: Lenz Fiedler

A full list of contributors can be found here.

Citing MALA

If you publish work which uses or mentions MALA, please cite the following paper:

J. A. Ellis, L. Fiedler, G. A. Popoola, N. A. Modine, J. A. Stephens, A. P. Thompson, A. Cangi, S. Rajamanickam (2021). Accelerating Finite-temperature Kohn-Sham Density Functional Theory with Deep Neural Networks. Phys. Rev. B 104, 035120 (2021)

alongside this repository.

mala's People

Contributors

Stargazers

Watchers

mala's Issues

Add an .xml interface for parameters

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:14

It would be very very nice to have an XML interface for the parameter class. As the entire workflow is controlled by one central parameter class, being able to save and load it would make portability way easier; one could conveniently store data and metadata in a couple of files.

Create "fesl" github organization

In GitLab by @RandomDefaultUser on Dec 18, 2020, 15:03

Add a checkpoint functionality

In GitLab by @RandomDefaultUser on Feb 16, 2021, 10:16

As it is anticipated that we will run longer training runs pretty soon and GPU jobs are limited to 48h on hemera, we should implement a "checkpointing" functionality, that will save our results for a training every e.g. 5 iterations.
Ideally, this would extend to the hyperparameter optimization as well. I am not completely sure how to do this though...

Integrate Horovod into code

In GitLab by @RandomDefaultUser on Dec 21, 2020, 09:32

Student project: Integrate Horovod into the workflow. This includes (tasks not necesseray in optimal order):

Create and checkout your own branch (if your name is Firstname Lastname: YYMMDD_FL_HorovodIntegration)
Install Horovod
Add Horovod installation to installation guide
Add Horovod commands to Trainer class
- Make usage of Horovod optional by adding a parameter in ParametersTraining
Add an example_*.py file to showcase Horovd integration and test it locally with downsized, preprocessed data
Contact @fiedle09 to setup a training session on the hemera cluster and test it
create a merge request, assign @fiedle09 - Done!

Enforce python coding standards

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:25

Get informed about the python coding standards at HZDR and add them to the code.

Clean up examples

In GitLab by @RandomDefaultUser on Feb 10, 2021, 13:28

ex01 and ex04 are essentially identical, since the new DataLoader does not support snapshot splitting anymore. But also the other examples could profit from some clean up.

Design "mala" icon

In GitLab by @RandomDefaultUser on Dec 18, 2020, 16:35

Setup a documentation

In GitLab by @RandomDefaultUser on Dec 18, 2020, 13:13

Use Sphinx
Talk to Daniel

Add new hyperparameter optimization

In GitLab by @RandomDefaultUser on Dec 21, 2020, 09:14

Student project: Add new hyperparameter optimization to the workflow. This includes (tasks not necesseray in optimal order):

Create and checkout your own branch (if your name is Firstname Lastname: YYMMDD_FL_NewHyperparameterOptimization)
Read the paper for Orthogonal Array Tuning (https://arxiv.org/abs/1907.13359, https://github.com/xiangzhang1015/OATM)
Add a new HyperparameterOptimization class to the network/ folder (e.g. HyperparameterOptimizerOAT)
Implement perform_study(self, data_handler) and set_optimal_parameters(self) for this new class:
- perform_study: identifies the optimal hyperparameters
- set_optimal_parameters: writes these parameters back to the parameter object
Implement the Orthogonal Array Tuning method in perform_study
- You can use the Objective classes if you want, but you don't have to
- You only need to implement it for feedforward network architectures
- Control all parameters for the tuning by adding parameters to ParametersHyperparameterOptimization
- Preferably use the .hlist object from ParametersHyperparameterOptimization as an user interface
- You don't have to use the OptunaParameter class with it if you don't want to!
Add an example_*.py file to showcase new hyperparameter optimization and test it locally with downsized, preprocessed data
Contact @fiedle09 to setup a training session on the hemera cluster and test it
create a merge request, assign @fiedle09 - Done!

Optional tasks, if you have a lot of time:

Add a HyperparameterOptimizationBase class
- Refactor the existing classes into HyperparameterOptimizationOptuna and HyperparameterOptimizationOAT and let them inherit from base class
if you used an Objective class for the implementation of OAT make sure correct inheritance is guaranteed

Test LDOS parser QE

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:08

Test the QE LDOS parser with multiple with energy grids <100 points. It should work, but it would be good to test it.

make data repository

In GitLab by @RandomDefaultUser on Jan 20, 2021, 14:11

So far, all the data needed for the examples is directly pushed into this repo. That feels excessive, we should make a separate data repo.

Make LAMMPS interface more stable

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:02

LAMMPS is called using its python API and one of their "python" run scripts. So far the implementation requires a disk operation, i.e. the data is written and read from ASE and even though this is only atomic data, it is still unnecessary. Furthermore LAMMPS crashes if a relative path is provided; this needs to be investigated.

Segfault when running some examples

In GitLab by @RandomDefaultUser on Jan 13, 2021, 00:48

$ python3 ex01_run_singleshot.py
[....]
--- Central debugging parameters. Can be used
    to e.g. reduce number of data. ---
        grid_dimensions: []
zsh: segmentation fault  python3 ex01_run_singleshot.py

Found with

ex01_run_singleshot.py
ex02_hyperparameter_optimization.py
ex04_snapshot_splitting.py

ex05 not tested (no special lammps version installed ATM).

Make MNIST executable again

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:11

The MNIST class was the first thing in this repo; it has not been maintained since the first, preliminary tests. It is not really necessary (since it is a classification problem, whereas we actually want to approximate a function) but it would nonetheless be nice to have it at hand.

Add GPAW support

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:04

So far only QE calculations can be used in the workflow. GPAW support is needed.

Unify units

In GitLab by @RandomDefaultUser on Jan 5, 2021, 14:54

We should use consistent units throughout the code. I would suggest using eV and A° and to add converters upon input.

Setup data repo for example data (including bigger examples) and add it to CI

In GitLab by @RandomDefaultUser on Feb 5, 2021, 10:44

Lazy loading

In GitLab by @RandomDefaultUser on Jan 20, 2021, 14:54

So far, all the data is read into RAM. While this is fast, it is very prohibitive in terms of architecture on which FESL can run. Add a convenient way to use lazy loading.

Clean up length units

In GitLab by @RandomDefaultUser on Jan 14, 2021, 10:57

I just realized that out of convenience I left all length units in Bohr. So while energy units are following ASE notations by being in eV, length units do not. We should make the switch to Angstrom soon.

Rename installation instructions to FESL

In GitLab by @RandomDefaultUser on Jan 13, 2021, 09:42

Rename "ml-dft-casus" to "fesl" in the README of installation instructions.
Rename the name of the conda environment to "fesl".

Add total energy module

In GitLab by @RandomDefaultUser on Jan 18, 2021, 10:00

Now that we have examples from Sandia, we need to add this asap.

Prepare a pre-alpha release

In GitLab by @RandomDefaultUser on Dec 15, 2020, 09:57

Since some new people are maybe coming on board for this project and the code so far has only been run on Lenz' laptop, the repo and code need to be prepared for easy installtion and access. This includes good setup routines (or at least guides!) and descriptive examples.

Find horovod bug

In GitLab by @RandomDefaultUser on Feb 15, 2021, 21:31

There seems to be some kind of bug in our horovod implementation, or maybe in the way we access it. I was trying to recreate the 298K example in the Sandia paper, and I ran into the following behavior:

In this example two snapshots and a "small" net are used. Horovod for some reason provides no speedup whatsoever. I don't no why this is so different from the tests made in the context of #21. A possible explanation would be that we got lucky with the amount of data we used or something like that.
As far as I can tell the problem seems to be related to the overhead horovod naturally introduces for the allreduce operations. My assumption would be that the speedup from multiple GPUs only evens out the overhead if either the GPU processing is rather slow, which might not be the case here, or if the overhead is not too big; the later could hint that the horovod installation on hemera is not correcty configured. I would not find this that unlikely, it was a real problem setting it up in the first place.

Anyway, this is imho the most important issue at the moment. Without this fixed we cannot really go to big training runs. For the 298K example I can run with 1 GPU, ~3minutes/epoch is alright and should ensure convergence in a reasonable time. But for anything bigger this will become prohibitve fast.

Develop DataConverter by using it

In GitLab by @RandomDefaultUser on Feb 9, 2021, 11:48

The first applications to new systems are upon us, and for this the DataConverter class should be used (and developed). It has currently never been used.

Evaluate inference variance w.r.t. random training init

In GitLab by @RandomDefaultUser on Jan 20, 2021, 22:53

As a follow-up to #31, we should investigate the variance of the FESL results. I assume by default we don't fix the random init of the optimizer before training? If so, then it would be important to know what influence that has e.g. train 10 times (more is better), what's the variance/standard deviation/appropriate statistical measure of FESL etot and how does it compare to the difference mean(FESL) - DFT_reference.

UPDATE: Extend/Improve CI

In GitLab by @RandomDefaultUser on Dec 17, 2020, 09:44

Enable GPU support
Add LAMMPS/QE installation to the yaml file, so that we can test the total energy module and SNAP descriptor creation as well
Generally add more tests
(optional) reduce build size of the tests; currently one CI run of the tests takes about 8-10 minutes, which is quite a bit but still ok in my opinion, but the big question is how this scales with an increased number of tests

Add early stopping

In GitLab by @RandomDefaultUser on Dec 17, 2020, 16:36

We want to start QE calculations from our framework.

Add training visualization

In GitLab by @RandomDefaultUser on Dec 21, 2020, 08:44

Student project: Add training data visualization to the workflow. This includes (tasks not necesseray in optimal order):

Create and checkout your own branch (if your name is Firstname Lastname: YYMMDD_FL_AddTrainingVisualization)
Identify and install the visualization framework of your choice (e.g. WandB, TensorBoard)
Add installation of this framework to the installation guide
Add a Visualizer class as visualizer.py in the network/ subdirectory
Add a new Parameter subclass ParametersVisualization in parameter.py and make sure it gets called in the constructor of the Parameter class
- Use this Parameter subclass to bundle all parameters you need during implementation
Setup visualization in the Trainer class
- either pass the visualization object in the constructor of Trainer or instantiate it in the constructor
- visualize relevant data (training loss, validation loss, test loss)
- add options for visualization to be passed by the ParametersVisualization class (such as "visualize_training_loss = False")
Add an example_*.py file to showcase visualization and test it locally with downsized, preprocessed data
Contact @fiedle09 to setup a training session on the hemera cluster and test it
create a merge request, assign @fiedle09 - Done!

Optimize network

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:06

So far, only preliminary tests have been made, and most time has been spent in setting up the code rather than actually testing the performance. This issue implies the installation on an HPC cluster as well.

Job Failed #231375

In GitLab by @RandomDefaultUser on Feb 21, 2021, 11:39

Job #231375 failed for d9ba788:

I have now used an new yml file fesl_cpu_environment.yml created from fesl_cpu_base_environment.yml according to the commit description. And now this errors arises in test-basic-functions. Hint: This yml file neither includes torchvision nor
torchaudio as those packages do not seem to be used at all.

File "fesl_tests.py", line 19, in <module>
    if test_tensor_memory(data_path+"Al_debug_2k_nr0.in.npy", standard_accuracy):
  File "/builds/multiscale-wdm/surrogate-models/fesl/fesl/test/tensor_memory.py", line 34, in test_tensor_memory
    test1 = torch.abs(torch.sum(torch_tensor-loaded_array[0:index1]))
TypeError: sub(): argument 'other' (position 1) must be Tensor, not numpy.ndarray

Update requirements.txt / environment.yaml

In GitLab by @RandomDefaultUser on Feb 11, 2021, 11:26

Related to #38.
The environment.yaml is currently outdated (and I believe the requirements.txt too). The former does not include oapackage and the latter does not install swig (I think, I am not sure and I can't test it at the moment). Furthermore it seems that we now have a restriction on which python verison to use, at least on the cluster. The total energy package there is built with a specific python version (it was originally python3.6.5, Mani might be changing it). Maybe we should talk about how we proceed here in our next meeting.

Clear up Scaling confusion

In GitLab by @RandomDefaultUser on Feb 16, 2021, 16:51

Data scaling has to be done using different algorithms in the lazy loading and the RAM case, since in the former snapshots are processed one after another. This leads to "incremental" scaling. For Normalization, both algorithms yield the same result. For standardization, the incremental code is based upon Sandia code. It does not exactly reproduce the RAM code. This leads to loss values that are not fully comparable. We have to investigate this difference and either fix the incremental code or use the incremental code for the RAM case as well.

ex07: unstable integration

In GitLab by @RandomDefaultUser on Jan 13, 2021, 00:51

$ python3 ex07_dos_analysis.py                                                                                                                                
Welcome to FESL.
Running ex07_dos_analysis.py
/home/elcorto/work/gitlab/multiscale-wdm/surrogate-models/fesl/fesl/fesl/targets/dos.py:134: IntegrationWarning: The occurrence of roundoff error is detected, which prevents 
  the requested tolerance from being achieved.  The error may be 
  underestimated.
  number_of_electrons, abserr = integrate.quad(
/home/elcorto/work/gitlab/multiscale-wdm/surrogate-models/fesl/fesl/fesl/targets/dos.py:165: IntegrationWarning: The occurrence of roundoff error is detected, which prevents 
  the requested tolerance from being achieved.  The error may be 
  underestimated.
  band_energy, abserr = integrate.quad(
/home/elcorto/work/gitlab/multiscale-wdm/surrogate-models/fesl/fesl/fesl/targets/calculation_helpers.py:47: RuntimeWarning: overflow encountered in exp
  return 1.0 / (1.0 + np.exp((energy_ev - fermi_energy_ev) / (kB * temperature_K)))

Benchmark new hyperparameter optimization techniques

In GitLab by @RandomDefaultUser on Feb 4, 2021, 11:43

We need to assess and investigate the novel hyperparameter optimization techniques that we now have at hand using real world examples. This could be a good task for student assistants.

ex03: path hard-coded

In GitLab by @RandomDefaultUser on Jan 13, 2021, 00:49

FileNotFoundError: [Errno 2] No such file or directory: '/home/fiedlerl/data/test_fp_snap/2.699gcc/Al_fp_200x200x200grid_94comps_snapshot0.npy'

New and modified files after linking with data repo and performing tests

In GitLab by @RandomDefaultUser on Feb 18, 2021, 14:08

After linking the fesl repo with the data repo, three new files are generated (see below). It may be a good idea to put those files in .gitignore of the fesl repo.
After executing python ex99_verify_all_examples.py four files in examples/data/ got modified (see below). This puzzles me, files of the repo should normally not be modified by programs and without the intention to commit those changes. W'll need to discuss how to handle this issue.

❯ git status
Auf Branch master
Ihr Branch ist auf demselben Stand wie 'origin/master'.

Änderungen, die nicht zum Commit vorgemerkt sind:
  (benutzen Sie "git add <Datei>...", um die Änderungen zum Commit vorzumerken)
  (benutzen Sie "git restore <Datei>...", um die Änderungen im Arbeitsverzeichnis zu verwerfen)
        geändert:       examples/data/ex08_iscaler.pkl
        geändert:       examples/data/ex08_network.pth
        geändert:       examples/data/ex08_oscaler.pkl
        geändert:       examples/data/ex08_params.pkl

Unversionierte Dateien:
  (benutzen Sie "git add <Datei>...", um die Änderungen zum Commit vorzumerken)
        examples/data_repo_path.py
        install/data_repo_link/data_repo_path.py
        test/data_repo_path.py

[PLACEHOLDER ISSUE] - for issue #49

This is to ensure the issue numbers in GitLab and GitHub are the same

Add QE/ASE Interface

In GitLab by @RandomDefaultUser on Dec 17, 2020, 16:26

We want to start QE calculations from our framework.

Adapt docstings to Numpy Style

In GitLab by @RandomDefaultUser on Feb 6, 2021, 11:08

Up to now only a few docstrings have been adapted to the Numpy Style.

All moduldes/submodules need to be revised in this regard:

Please check Numpy Style validity with pydocstyle --convention=numpy foo.py

Do more horovod benchmarks

In GitLab by @RandomDefaultUser on Feb 4, 2021, 11:34

As discussed in #21 there is still a small oddity when using 2GPUs per node. It is not prohibitive for our current investigations, but it should not be forgotten. What would be really nice would be a comprehensive study as to what parallelization strategy works best for a real world example, so that we may use it for all upcoming calculations. This could be a good task for student assistants.

Add better inference handling

In GitLab by @RandomDefaultUser on Feb 10, 2021, 16:50

Just like we have an option to train without specifying test data we should have an option to just add test data and do an inference. I had something similar before redoing the DataHandler, but even that would not have worked.
What we need is a class that also interfaces the post processing.

CUDA stability

In GitLab by @RandomDefaultUser on Jan 7, 2021, 15:49

The CUDA implementation is minimal and error prone. There is no check whether or not a network is actually on CUDA etc. Improve this.

post-processing recap

In GitLab by @RandomDefaultUser on Jan 13, 2021, 09:30

I believe some of the following items have already been checked by @fiedle09.

Check the numerical integration routines on QE data:

Integrate the QE density (Al.dens) over spatial grid. Does this yield the correct number of electrons?
Integrate QE LDOS over energy grid. Does this yield the correct density (when compared to Al.dens)?
Integrate the QE LDOS over spatial grid. Does this yield the corrected LDOS (when compared to Al.dos)?

Add postporcessing

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:15

The value of the loss function is not actually the metric we are interested in. What we are actually interested in is the energy. Add postprocessing functionalities to access this metric, which should help training/optimization.

Clean up requirement and environment files

In GitLab by @RandomDefaultUser on Feb 5, 2021, 19:15

I'd love to have a separate requirements.txt (and mabye a conda yaml file) for the installation requirements of the Sphinx documentation system. To be more precise I recommend

rename requirements.txt to requirements_fesl.txt
rename environment.yaml to environment_fesl.yaml
add new files requirements_docs.txt and environment_docs.yaml
put/move all these files inside install folder
update .gitlab-ci.yaml (make use of above files), setup.py and documentation referring to those files

@cangi21 @fiedle09 @schmer52 What do you think about it?

Make data rescaling more efficient

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:12

At the moment, data rescaling always needs the usage of additional memory. This is somewhat unavoidable, as it transforms the data, rather than just reslicing it. But maybe we can still save some memory here by e.g. discarding the raw data.

Make classes more "self-sufficient"

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:09

At the moment, it is necessary for the user to create target/descriptor objects just to call a datahandler. In theory, the HandlerInterface could create these objects as well. Refactor the code so that it does.

Inverse scaling gives wrong values

In GitLab by @RandomDefaultUser on Jan 11, 2021, 17:19

If LDOS data is added as a snapshot and the scaled and inverse scaled, the resulting DOS is different from the originally calculated DOS. This is a bug.

Use horovod for lazy-loading-like functionality

In GitLab by @RandomDefaultUser on Feb 8, 2021, 15:53

I've read a little bit about horovod and from what I understand now horovod reads all the data on each process, unless specified otherwise. This is why we need both horovod AND lazy loading. This is far from optimal since that means training is very slow for bigger datasets (because we have to use lazy loading) or we discard a lot of RAM (if we can fit the entire data set into memory, we do so multiple times across each node, which is equally bad).
Ideally our parallelization strategy would look like this:

Find out if we have enough processes so that lazy loading is not needed
If that is the case, split the dataset evenly and THEN load it into memory
If 1. is not the case, use lazy loading; at this point, no splitting of the dataset is needed. We are loading it into memory anyway.

For a first implementation one could put 1. in the hands of the user - automatic assessment of the system might be a little bit over-engineered. But 2. and 3. would bring massive performance gains if I am not mistaken.

OpenPMD integration

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:02

File I/O should be changed to OpenPMD as soon as possible.

mala-project / mala Goto Github PK

mala's Introduction

MALA

Installation

Running

Contributors

Citing MALA

mala's People

Contributors

Stargazers

Watchers

Forkers

mala's Issues

Recommend Projects

Recommend Topics

Recommend Org