Giter Site home page Giter Site logo

mala-project / mala Goto Github PK

View Code? Open in Web Editor NEW
77.0 10.0 24.0 57.11 MB

Materials Learning Algorithms. A framework for machine learning materials properties from first-principles data.

Home Page: https://mala-project.github.io/mala/

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.09% Python 96.27% Shell 0.20% Fortran 3.43%
machine-learning dft density-functional-theory electronic-structure neural-network

mala's Introduction

image

MALA

CPU image image DOI

MALA (Materials Learning Algorithms) is a data-driven framework to generate surrogate models of density functional theory calculations based on machine learning. Its purpose is to enable multiscale modeling by bypassing computationally expensive steps in state-of-the-art density functional simulations.

MALA is designed as a modular and open-source python package. It enables users to perform the entire modeling toolchain using only a few lines of code. MALA is jointly developed by the Sandia National Laboratories (SNL) and the Center for Advanced Systems Understanding (CASUS). See Contributing for contributing code to the repository.

This repository is structured as follows:

├── examples : contains useful examples to get you started with the package
├── install : contains scripts for setting up this package on your machine
├── mala : the source code itself
├── test : test scripts used during development, will hold tests for CI in the future
└── docs : Sphinx documentation folder

Installation

WARNING: Even if you install MALA via PyPI, please consult the full installation instructions afterwards. External modules (like the QuantumESPRESSO bindings) are not distributed via PyPI!

Please refer to Installation of MALA.

Running

You can familiarize yourself with the usage of this package by running the examples in the example/ folder.

Contributors

MALA is jointly maintained by

A full list of contributors can be found here.

Citing MALA

If you publish work which uses or mentions MALA, please cite the following paper:

J. A. Ellis, L. Fiedler, G. A. Popoola, N. A. Modine, J. A. Stephens, A. P. Thompson, A. Cangi, S. Rajamanickam (2021). Accelerating Finite-temperature Kohn-Sham Density Functional Theory with Deep Neural Networks. Phys. Rev. B 104, 035120 (2021)

alongside this repository.

mala's People

Contributors

acangi avatar athomps avatar danielkotik avatar dytnvgl avatar elcorto avatar ellisja avatar franzpoeschel avatar gapopoo avatar htahmasbi avatar jadamstephens avatar johelli avatar joshrackers avatar kyledmiller avatar msverma101 avatar nils-hoffmann avatar omarhexa avatar randomdefaultuser avatar srajama1 avatar szabo137 avatar timcallow avatar vlad-oles avatar zevrap-81 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mala's Issues

Add an .xml interface for parameters

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:14

It would be very very nice to have an XML interface for the parameter class. As the entire workflow is controlled by one central parameter class, being able to save and load it would make portability way easier; one could conveniently store data and metadata in a couple of files.

Add a checkpoint functionality

In GitLab by @RandomDefaultUser on Feb 16, 2021, 10:16

As it is anticipated that we will run longer training runs pretty soon and GPU jobs are limited to 48h on hemera, we should implement a "checkpointing" functionality, that will save our results for a training every e.g. 5 iterations.
Ideally, this would extend to the hyperparameter optimization as well. I am not completely sure how to do this though...

Integrate Horovod into code

In GitLab by @RandomDefaultUser on Dec 21, 2020, 09:32

Student project: Integrate Horovod into the workflow. This includes (tasks not necesseray in optimal order):

  • Create and checkout your own branch (if your name is Firstname Lastname: YYMMDD_FL_HorovodIntegration)
  • Install Horovod
  • Add Horovod installation to installation guide
  • Add Horovod commands to Trainer class
    • Make usage of Horovod optional by adding a parameter in ParametersTraining
  • Add an example_*.py file to showcase Horovd integration and test it locally with downsized, preprocessed data
  • Contact @fiedle09 to setup a training session on the hemera cluster and test it
  • create a merge request, assign @fiedle09 - Done!

Clean up examples

In GitLab by @RandomDefaultUser on Feb 10, 2021, 13:28

ex01 and ex04 are essentially identical, since the new DataLoader does not support snapshot splitting anymore. But also the other examples could profit from some clean up.

Add new hyperparameter optimization

In GitLab by @RandomDefaultUser on Dec 21, 2020, 09:14

Student project: Add new hyperparameter optimization to the workflow. This includes (tasks not necesseray in optimal order):

  • Create and checkout your own branch (if your name is Firstname Lastname: YYMMDD_FL_NewHyperparameterOptimization)
  • Read the paper for Orthogonal Array Tuning (https://arxiv.org/abs/1907.13359, https://github.com/xiangzhang1015/OATM)
  • Add a new HyperparameterOptimization class to the network/ folder (e.g. HyperparameterOptimizerOAT)
  • Implement perform_study(self, data_handler) and set_optimal_parameters(self) for this new class:
    • perform_study: identifies the optimal hyperparameters
    • set_optimal_parameters: writes these parameters back to the parameter object
  • Implement the Orthogonal Array Tuning method in perform_study
    • You can use the Objective classes if you want, but you don't have to
    • You only need to implement it for feedforward network architectures
    • Control all parameters for the tuning by adding parameters to ParametersHyperparameterOptimization
    • Preferably use the .hlist object from ParametersHyperparameterOptimization as an user interface
    • You don't have to use the OptunaParameter class with it if you don't want to!
  • Add an example_*.py file to showcase new hyperparameter optimization and test it locally with downsized, preprocessed data
  • Contact @fiedle09 to setup a training session on the hemera cluster and test it
  • create a merge request, assign @fiedle09 - Done!

Optional tasks, if you have a lot of time:

  • Add a HyperparameterOptimizationBase class
    • Refactor the existing classes into HyperparameterOptimizationOptuna and HyperparameterOptimizationOAT and let them inherit from base class
  • if you used an Objective class for the implementation of OAT make sure correct inheritance is guaranteed

Test LDOS parser QE

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:08

Test the QE LDOS parser with multiple with energy grids <100 points. It should work, but it would be good to test it.

make data repository

In GitLab by @RandomDefaultUser on Jan 20, 2021, 14:11

So far, all the data needed for the examples is directly pushed into this repo. That feels excessive, we should make a separate data repo.

Make LAMMPS interface more stable

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:02

LAMMPS is called using its python API and one of their "python" run scripts. So far the implementation requires a disk operation, i.e. the data is written and read from ASE and even though this is only atomic data, it is still unnecessary. Furthermore LAMMPS crashes if a relative path is provided; this needs to be investigated.

Segfault when running some examples

In GitLab by @RandomDefaultUser on Jan 13, 2021, 00:48

$ python3 ex01_run_singleshot.py
[....]
--- Central debugging parameters. Can be used
    to e.g. reduce number of data. ---
        grid_dimensions: []
zsh: segmentation fault  python3 ex01_run_singleshot.py

Found with

  • ex01_run_singleshot.py
  • ex02_hyperparameter_optimization.py
  • ex04_snapshot_splitting.py

ex05 not tested (no special lammps version installed ATM).

Make MNIST executable again

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:11

The MNIST class was the first thing in this repo; it has not been maintained since the first, preliminary tests. It is not really necessary (since it is a classification problem, whereas we actually want to approximate a function) but it would nonetheless be nice to have it at hand.

Unify units

In GitLab by @RandomDefaultUser on Jan 5, 2021, 14:54

We should use consistent units throughout the code. I would suggest using eV and A° and to add converters upon input.

Lazy loading

In GitLab by @RandomDefaultUser on Jan 20, 2021, 14:54

So far, all the data is read into RAM. While this is fast, it is very prohibitive in terms of architecture on which FESL can run. Add a convenient way to use lazy loading.

Clean up length units

In GitLab by @RandomDefaultUser on Jan 14, 2021, 10:57

I just realized that out of convenience I left all length units in Bohr. So while energy units are following ASE notations by being in eV, length units do not. We should make the switch to Angstrom soon.

Prepare a pre-alpha release

In GitLab by @RandomDefaultUser on Dec 15, 2020, 09:57

Since some new people are maybe coming on board for this project and the code so far has only been run on Lenz' laptop, the repo and code need to be prepared for easy installtion and access. This includes good setup routines (or at least guides!) and descriptive examples.

Find horovod bug

In GitLab by @RandomDefaultUser on Feb 15, 2021, 21:31

There seems to be some kind of bug in our horovod implementation, or maybe in the way we access it. I was trying to recreate the 298K example in the Sandia paper, and I ran into the following behavior:Figure_1

In this example two snapshots and a "small" net are used. Horovod for some reason provides no speedup whatsoever. I don't no why this is so different from the tests made in the context of #21. A possible explanation would be that we got lucky with the amount of data we used or something like that.
As far as I can tell the problem seems to be related to the overhead horovod naturally introduces for the allreduce operations. My assumption would be that the speedup from multiple GPUs only evens out the overhead if either the GPU processing is rather slow, which might not be the case here, or if the overhead is not too big; the later could hint that the horovod installation on hemera is not correcty configured. I would not find this that unlikely, it was a real problem setting it up in the first place.

Anyway, this is imho the most important issue at the moment. Without this fixed we cannot really go to big training runs. For the 298K example I can run with 1 GPU, ~3minutes/epoch is alright and should ensure convergence in a reasonable time. But for anything bigger this will become prohibitve fast.

Evaluate inference variance w.r.t. random training init

In GitLab by @RandomDefaultUser on Jan 20, 2021, 22:53

As a follow-up to #31, we should investigate the variance of the FESL results. I assume by default we don't fix the random init of the optimizer before training? If so, then it would be important to know what influence that has e.g. train 10 times (more is better), what's the variance/standard deviation/appropriate statistical measure of FESL etot and how does it compare to the difference mean(FESL) - DFT_reference.

UPDATE: Extend/Improve CI

In GitLab by @RandomDefaultUser on Dec 17, 2020, 09:44

  • Enable GPU support
  • Add LAMMPS/QE installation to the yaml file, so that we can test the total energy module and SNAP descriptor creation as well
  • Generally add more tests
  • (optional) reduce build size of the tests; currently one CI run of the tests takes about 8-10 minutes, which is quite a bit but still ok in my opinion, but the big question is how this scales with an increased number of tests

Add training visualization

In GitLab by @RandomDefaultUser on Dec 21, 2020, 08:44

Student project: Add training data visualization to the workflow. This includes (tasks not necesseray in optimal order):

  • Create and checkout your own branch (if your name is Firstname Lastname: YYMMDD_FL_AddTrainingVisualization)
  • Identify and install the visualization framework of your choice (e.g. WandB, TensorBoard)
  • Add installation of this framework to the installation guide
  • Add a Visualizer class as visualizer.py in the network/ subdirectory
  • Add a new Parameter subclass ParametersVisualization in parameter.py and make sure it gets called in the constructor of the Parameter class
    • Use this Parameter subclass to bundle all parameters you need during implementation
  • Setup visualization in the Trainer class
    • either pass the visualization object in the constructor of Trainer or instantiate it in the constructor
    • visualize relevant data (training loss, validation loss, test loss)
    • add options for visualization to be passed by the ParametersVisualization class (such as "visualize_training_loss = False")
  • Add an example_*.py file to showcase visualization and test it locally with downsized, preprocessed data
  • Contact @fiedle09 to setup a training session on the hemera cluster and test it
  • create a merge request, assign @fiedle09 - Done!

Optimize network

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:06

So far, only preliminary tests have been made, and most time has been spent in setting up the code rather than actually testing the performance. This issue implies the installation on an HPC cluster as well.

Job Failed #231375

In GitLab by @RandomDefaultUser on Feb 21, 2021, 11:39

Job #231375 failed for d9ba788:

I have now used an new yml file fesl_cpu_environment.yml created from fesl_cpu_base_environment.yml according to the commit description. And now this errors arises in test-basic-functions. Hint: This yml file neither includes torchvision nor
torchaudio as those packages do not seem to be used at all.

File "fesl_tests.py", line 19, in <module>
    if test_tensor_memory(data_path+"Al_debug_2k_nr0.in.npy", standard_accuracy):
  File "/builds/multiscale-wdm/surrogate-models/fesl/fesl/test/tensor_memory.py", line 34, in test_tensor_memory
    test1 = torch.abs(torch.sum(torch_tensor-loaded_array[0:index1]))
TypeError: sub(): argument 'other' (position 1) must be Tensor, not numpy.ndarray

Update requirements.txt / environment.yaml

In GitLab by @RandomDefaultUser on Feb 11, 2021, 11:26

Related to #38.
The environment.yaml is currently outdated (and I believe the requirements.txt too). The former does not include oapackage and the latter does not install swig (I think, I am not sure and I can't test it at the moment). Furthermore it seems that we now have a restriction on which python verison to use, at least on the cluster. The total energy package there is built with a specific python version (it was originally python3.6.5, Mani might be changing it). Maybe we should talk about how we proceed here in our next meeting.

Clear up Scaling confusion

In GitLab by @RandomDefaultUser on Feb 16, 2021, 16:51

Data scaling has to be done using different algorithms in the lazy loading and the RAM case, since in the former snapshots are processed one after another. This leads to "incremental" scaling. For Normalization, both algorithms yield the same result. For standardization, the incremental code is based upon Sandia code. It does not exactly reproduce the RAM code. This leads to loss values that are not fully comparable. We have to investigate this difference and either fix the incremental code or use the incremental code for the RAM case as well.

ex07: unstable integration

In GitLab by @RandomDefaultUser on Jan 13, 2021, 00:51

$ python3 ex07_dos_analysis.py                                                                                                                                
Welcome to FESL.
Running ex07_dos_analysis.py
/home/elcorto/work/gitlab/multiscale-wdm/surrogate-models/fesl/fesl/fesl/targets/dos.py:134: IntegrationWarning: The occurrence of roundoff error is detected, which prevents 
  the requested tolerance from being achieved.  The error may be 
  underestimated.
  number_of_electrons, abserr = integrate.quad(
/home/elcorto/work/gitlab/multiscale-wdm/surrogate-models/fesl/fesl/fesl/targets/dos.py:165: IntegrationWarning: The occurrence of roundoff error is detected, which prevents 
  the requested tolerance from being achieved.  The error may be 
  underestimated.
  band_energy, abserr = integrate.quad(
/home/elcorto/work/gitlab/multiscale-wdm/surrogate-models/fesl/fesl/fesl/targets/calculation_helpers.py:47: RuntimeWarning: overflow encountered in exp
  return 1.0 / (1.0 + np.exp((energy_ev - fermi_energy_ev) / (kB * temperature_K)))

ex03: path hard-coded

In GitLab by @RandomDefaultUser on Jan 13, 2021, 00:49

FileNotFoundError: [Errno 2] No such file or directory: '/home/fiedlerl/data/test_fp_snap/2.699gcc/Al_fp_200x200x200grid_94comps_snapshot0.npy'

New and modified files after linking with data repo and performing tests

In GitLab by @RandomDefaultUser on Feb 18, 2021, 14:08

  1. After linking the fesl repo with the data repo, three new files are generated (see below). It may be a good idea to put those files in .gitignore of the fesl repo.
  2. After executing python ex99_verify_all_examples.py four files in examples/data/ got modified (see below). This puzzles me, files of the repo should normally not be modified by programs and without the intention to commit those changes. W'll need to discuss how to handle this issue.
❯ git status
Auf Branch master
Ihr Branch ist auf demselben Stand wie 'origin/master'.

Änderungen, die nicht zum Commit vorgemerkt sind:
  (benutzen Sie "git add <Datei>...", um die Änderungen zum Commit vorzumerken)
  (benutzen Sie "git restore <Datei>...", um die Änderungen im Arbeitsverzeichnis zu verwerfen)
        geändert:       examples/data/ex08_iscaler.pkl
        geändert:       examples/data/ex08_network.pth
        geändert:       examples/data/ex08_oscaler.pkl
        geändert:       examples/data/ex08_params.pkl

Unversionierte Dateien:
  (benutzen Sie "git add <Datei>...", um die Änderungen zum Commit vorzumerken)
        examples/data_repo_path.py
        install/data_repo_link/data_repo_path.py
        test/data_repo_path.py

Adapt docstings to Numpy Style

In GitLab by @RandomDefaultUser on Feb 6, 2021, 11:08

Up to now only a few docstrings have been adapted to the Numpy Style.

All moduldes/submodules need to be revised in this regard:

  • common
  • datahandling
  • descriptors
  • network
  • targets

Please check Numpy Style validity with pydocstyle --convention=numpy foo.py

Do more horovod benchmarks

In GitLab by @RandomDefaultUser on Feb 4, 2021, 11:34

As discussed in #21 there is still a small oddity when using 2GPUs per node. It is not prohibitive for our current investigations, but it should not be forgotten. What would be really nice would be a comprehensive study as to what parallelization strategy works best for a real world example, so that we may use it for all upcoming calculations. This could be a good task for student assistants.

Add better inference handling

In GitLab by @RandomDefaultUser on Feb 10, 2021, 16:50

Just like we have an option to train without specifying test data we should have an option to just add test data and do an inference. I had something similar before redoing the DataHandler, but even that would not have worked.
What we need is a class that also interfaces the post processing.

CUDA stability

In GitLab by @RandomDefaultUser on Jan 7, 2021, 15:49

The CUDA implementation is minimal and error prone. There is no check whether or not a network is actually on CUDA etc. Improve this.

post-processing recap

In GitLab by @RandomDefaultUser on Jan 13, 2021, 09:30

I believe some of the following items have already been checked by @fiedle09.

Check the numerical integration routines on QE data:

  • Integrate the QE density (Al.dens) over spatial grid. Does this yield the correct number of electrons?
  • Integrate QE LDOS over energy grid. Does this yield the correct density (when compared to Al.dens)?
  • Integrate the QE LDOS over spatial grid. Does this yield the corrected LDOS (when compared to Al.dos)?

Add postporcessing

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:15

The value of the loss function is not actually the metric we are interested in. What we are actually interested in is the energy. Add postprocessing functionalities to access this metric, which should help training/optimization.

Clean up requirement and environment files

In GitLab by @RandomDefaultUser on Feb 5, 2021, 19:15

I'd love to have a separate requirements.txt (and mabye a conda yaml file) for the installation requirements of the Sphinx documentation system. To be more precise I recommend

  • rename requirements.txt to requirements_fesl.txt
  • rename environment.yaml to environment_fesl.yaml
  • add new files requirements_docs.txt and environment_docs.yaml
  • put/move all these files inside install folder
  • update .gitlab-ci.yaml (make use of above files), setup.py and documentation referring to those files

@cangi21 @fiedle09 @schmer52 What do you think about it?

Make data rescaling more efficient

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:12

At the moment, data rescaling always needs the usage of additional memory. This is somewhat unavoidable, as it transforms the data, rather than just reslicing it. But maybe we can still save some memory here by e.g. discarding the raw data.

Make classes more "self-sufficient"

In GitLab by @RandomDefaultUser on Dec 15, 2020, 10:09

At the moment, it is necessary for the user to create target/descriptor objects just to call a datahandler. In theory, the HandlerInterface could create these objects as well. Refactor the code so that it does.

Use horovod for lazy-loading-like functionality

In GitLab by @RandomDefaultUser on Feb 8, 2021, 15:53

I've read a little bit about horovod and from what I understand now horovod reads all the data on each process, unless specified otherwise. This is why we need both horovod AND lazy loading. This is far from optimal since that means training is very slow for bigger datasets (because we have to use lazy loading) or we discard a lot of RAM (if we can fit the entire data set into memory, we do so multiple times across each node, which is equally bad).
Ideally our parallelization strategy would look like this:

  1. Find out if we have enough processes so that lazy loading is not needed
  2. If that is the case, split the dataset evenly and THEN load it into memory
  3. If 1. is not the case, use lazy loading; at this point, no splitting of the dataset is needed. We are loading it into memory anyway.

For a first implementation one could put 1. in the hands of the user - automatic assessment of the system might be a little bit over-engineered. But 2. and 3. would bring massive performance gains if I am not mistaken.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.