Giter Site home page Giter Site logo

nvidia / modulus Goto Github PK

View Code? Open in Web Editor NEW
800.0 37.0 176.0 79.04 MB

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods

Home Page: https://developer.nvidia.com/modulus

License: Apache License 2.0

Python 99.45% Shell 0.02% Makefile 0.10% Dockerfile 0.43%
deep-learning machine-learning nvidia-gpu physics pytorch

modulus's Introduction

Modulus (Beta)

Project Status: Active - The project has reached a stable, usable state and is being actively developed. GitHub Code style: black

Modulus is an open source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods.

Whether you are exploring the use of Neural operators like Fourier Neural Operators or interested in Physics informed Neural Networks or a hybrid approach in between, Modulus provides you with the optimized stack that will enable you to train your models at real world scale.

This package is the core module that provides the core algorithms, network architectures and utilities that cover a broad spectrum of physics-constrained and data-driven workflows to suit the diversity of use cases in the science and engineering disciplines.

Detailed information on features and capabilities can be found in the Modulus documentation.

Modulus

Modulus Packages

  • Modulus (Beta): Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods.
  • Modulus Symbolic (Beta): Framework providing pythonic APIs, algorithms and utilities to be used with Modulus core to physics inform model training as well as higher level abstraction for domain experts.

Domain Specific Packages

  • Earth-2 MIP (Beta): Python framework to enable climate researchers and scientists to explore and experiment with AI models for weather and climate.

Installation

PyPi

The recommended method for installing the latest version of Modulus is using PyPi:

pip install nvidia-modulus

The installation can be verified by running a simple python code snippet as shown below:

python
>>> import torch
>>> from modulus.models.mlp.fully_connected import FullyConnected
>>> model = FullyConnected(in_features=32, out_features=64)
>>> input = torch.randn(128, 32)
>>> output = model(input)
>>> output.shape
torch.Size([128, 64])

Optional dependencies

Modulus has many optional dependencies that are used in specific components. When using pip, all dependencies used in Modulus can be installed with pip install nvidia-modulus[all]. If you are developing Modulus, developer dependencies can be installed using pip install nvidia-modulus[dev]. Otherwise, additional dependencies can be installed on a case by case basis. A detailed information on installing the optional dependencies can be found in the Getting Started Guide.

NVCR Container

The recommended Modulus docker image can be pulled from the NVIDIA Container Registry:

docker pull nvcr.io/nvidia/modulus/modulus:24.04

Inside the container you can clone the Modulus git repositories and get started with the examples. Below command show the instructions to launch the modulus container and run an examples from this repo.

docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia \
--rm -it nvcr.io/nvidia/modulus/modulus:24.04 bash
git clone https://github.com/NVIDIA/modulus.git
cd modulus/examples/cfd/darcy_fno/
pip install warp-lang # install NVIDIA Warp to run the darcy example
python train_fno_darcy.py

From Source

Package

For a local build of the Modulus Python package from source use:

git clone [email protected]:NVIDIA/modulus.git && cd modulus

pip install --upgrade pip
pip install .

Source Container

To build Modulus docker image:

docker build -t modulus:deploy \
    --build-arg TARGETPLATFORM=linux/amd64 --target deploy -f Dockerfile .

Alternatively, you can run make container-deploy

To build CI image:

docker build -t modulus:ci \
    --build-arg TARGETPLATFORM=linux/amd64 --target ci -f Dockerfile .

Alternatively, you can run make container-ci.

Currently only linux/amd64 and linux/arm64 platforms are supported. If using linux/arm64, some dependencies like warp-lang might not install correctly.

Contributing

Modulus is an open source collaboration and its success is rooted in community contribution to further the field of Physics-ML. Thank you for contributing to the project so others can build on your contribution. For guidance on making a contribution to Modulus, please refer to the contributing guidelines.

Communication

  • Github Discussions: Discuss new architectures, implementations, Physics-ML research, etc.
  • GitHub Issues: Bug reports, feature requests, install issues, etc.
  • Modulus Forum: The Modulus Forum hosts an audience of new to moderate level users and developers for general chat, online discussions, collaboration, etc.

License

Modulus is provided under the Apache License 2.0, please see LICENSE.txt for full license text.

modulus's People

Contributors

akshaysubr avatar alexey-kamenev avatar azrael417 avatar briacmb avatar crackcode123 avatar dallasfoster avatar daviddpruitt avatar dearleiii avatar dlshu avatar fresleven avatar hasethinvd avatar ivanauyeung avatar jihyun-nv avatar jleinonen avatar joneseth1 avatar ktangsali avatar loliverhennigh avatar lucapegolotti avatar mariusaurus avatar mnabian avatar mortezamardani avatar nbren12 avatar nickgeneva avatar ram-cherukuri avatar sifanexisted avatar simonbyrne avatar stadlmax avatar tge25 avatar xuhan314 avatar yairchn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

modulus's Issues

๐Ÿš€[FEA]: Add flake8 to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Add better style guide enforcement using flake8

Describe any alternatives you have considered

ruff

๐Ÿ›[BUG]: CI Docker Image Has Modulus files still present in site packages

Version

0.3.0a0

On which installation method(s) does this occur?

Docker

Describe the issue

Presently the CI docker image has an issue where modulus install files are still present on the system. This is because the editable install still places some linked files in the site packages.

This causes issues because pip will not install modulus since it detects its already installed and on the highest version despite the package not truly being there.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

๐Ÿ›[BUG]: Error with using modulus docker image as base image in dockerfile

Version

23.05 (docker)

On which installation method(s) does this occur?

Docker

Describe the issue

I am trying to use the docker image (nvcr.io/nvidia/modulus/modulus:23.05) as a starting point in a Dockerfile: I'd like to add some of my own packages on top. However, I get the error reported below when I try to use it as a starting point. The first line in the Dockerfile (FROM nvcr.io/nvidia/modulus/modulus:23.05) causes the issue.

From googling it appears that there can be many reasons for getting the error message I get (e.g. a problem with NGC or not logging in. When I try to use a different NVIDIA base image, (e.g. FROM nvcr.io/nvidia/pytorch:$PYT_VER-py3 as env), then everything works smoothly.

If you have any guidance on how to avoid this error, that would be much appreciated.

Minimum reproducible example

FROM nvcr.io/nvidia/modulus/modulus:23.05

Relevant log output

=> ERROR [internal] load metadata for nvcr.io/nvidia/modulus/modulus:23.05                                                                                                     0.4s

failed to solve with frontend dockerfile.v0: failed to create LLB definition: failed to authorize: rpc error: code = Unknown desc = failed to fetch anonymous token: unexpected status: 401 Unauthorized

Environment details

I get the error when I run

docker build  -t amahesh19/modulus_base:0.1 .

in the directory with the Dockerfile.

๐Ÿ›[BUG]: ERA5 DALI datapipe hangs indefinitely in multi-GPU/multi-Node setting if the datapipe size is not selected correctly.

Version

0.2.0

On which installation method(s) does this occur?

Docker

Describe the issue

This can mostly be fixed by modifying the number of samples in the datapipe (for example here) to be divisible by the number of processors/GPUs.

A long term fix would be to automatically avoid failure cases where the size is not exactly divisible by the number of GPUs.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

๐Ÿš€[FEA]: DistributedManager for SLURM when local processes can't access all GPUs

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

I use Modulus DistributedManager with SLURM. Right now, DistributedManager sets the local_rank based on the number of local processes on the node (this line).

local_rank = int(os.environ.get("SLURM_LOCALID"))

This line) then sets the device based on the local_rank.

manager._device = torch.device(
            f"cuda:{manager.local_rank}" if torch.cuda.is_available() else "cpu"
        )

Notably, this line breaks if "SLURM_LOCALID" is greater than torch.cuda.device_count().

In my use case, however, I need to use the SBATCH โ€”-gpu-bind:map_gpus:0,1,2,3 flag on a node with 4 GPUs. With 4 processes per node and 4 GPUs per node, each process only sees 1 device called cuda:0, though that name actually refers to 4 different GPUs. (This forum explains why I need to use this flag.)

There may be other use cases where the number of local processes specified through SLURM may not equal the number of GPUs accessible (e.g. running FourCastNet with 4 GPUs and 1 process per GPU, but analyzing the output with more processes).

My request would be to add a flag to DistributedManager, through which I could specify that the behavior below is desired for SLURM as well.

manager._local_rank = rank % torch.cuda.device_count()

This ensures that torch.device is not called on a device that can't be accessed.

Describe any alternatives you have considered

Without a flag, DistributedManager.initialize() returns an error because torch.device is used to access a device that is not available. I could make an equivalent for DistributedManager, or I could create a subclass of DistributedManager that overrides the initialize_slurm method. Let me know if that would be the preferred solution, and I can continue with my fix on my local end.

๐Ÿš€[FEA]: Update SFNO to use `DistributedManager` without model parallelism

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

SFNO currently uses a separate comm_v2 module for distributed setup. This should be changed so it can use the DistributedManager like all other models and utils in Modulus

Describe any alternatives you have considered

No response

Additional context

No response

๐Ÿš€[FEA]: Add bandit to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Add bandit hook for security scanning

Describe any alternatives you have considered

flake8-bandit

ruff

๐Ÿ›[BUG]: Add sklearn as a dependency

Version

0.1.0

On which installation method(s) does this occur?

Pip

Describe the issue

sklearn is used in GraphCast for graph construction and needs to be added as a dependency

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

๐Ÿš€[FEA]: Add support for cosine zenith and static datasets to the climate datapipe

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Cosine zenith is used in the SFNO model but is not supported in the Modulus dataloader for climate. Static datasets can be also handled in the dataloader.

Describe any alternatives you have considered

No response

Additional context

No response

๐Ÿš€[FEA]: Update min-max normalization to Gaussian normalization in the Ahmed body datapipe

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Gaussian normalization has shown improvements in accuracy compared to min-max normalization. Consider switching.

Describe any alternatives you have considered

No response

๐Ÿ›[BUG]: Fix labels for templates

Version

0.2.0

On which installation method(s) does this occur?

No response

Describe the issue

Templates labels need correction and add the needs triage label by default.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

โ›ฐ๏ธ[EPIC]: Model Parallel SFNO Integration

Tracking model parallel SFNO integration into Modulus core / launch

Core

  1. enhancement external
    akshaysubr
  2. 1 - On Deck enhancement
    akshaysubr
  3. 2 - In Progress enhancement
    NickGeneva
  4. 2 - In Progress bug
    mnabian
  5. 3 - Ready for Review distributed enhancement
    akshaysubr
  6. 0 - Backlog distributed enhancement
    akshaysubr

Launch

  1. 2 - In Progress enhancement
    mnabian
  2. 0 - Backlog enhancement
    NickGeneva

๐Ÿš€[FEA]: Add large file pre-commit hook

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Set max file size to 500kb

Describe any alternatives you have considered

No response

๐Ÿš€[FEA]: Add S3 testing for filesystem abstraction

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Should add testing for the S3 branch of _download_cached(...)

Describe any alternatives you have considered

Shelling out to the aws cli as before but that has security risks

Additional context

Should test both with recursive=True and recursive=False

๐Ÿš€[FEA]: Batched support for the DLWP wrapper/plugin

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Currently the MIP plugin of DLWP only supports a single input. Batched support would be needed for better/faster inferencing workflows.

Describe any alternatives you have considered

No response

๐Ÿ›[BUG]: Container 23.08 is missing warp installation

Version

23.08

On which installation method(s) does this occur?

NGC container nvcr.io/nvidia/modulus/modulus:23.08

Describe the issue

The NGC container for version 23.08 does not include warp, which is used in eg modulus/datapipes/benchmarks/darcy.py .

Minimum reproducible example

$ docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia --rm -it nvcr.io/nvidia/modulus/modulus:23.08 bash
$ python
>>> import warp

Relevant log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'warp'

Environment details

nvcr.io/nvidia/modulus/modulus:23.08

๐Ÿš€[FEA]: Better handling of the static dataset in GraphCast

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Currently the static dataset is handled in the init method of the GraphCast model. A This can be better handled by getting the static data from the dataloader instead.

Describe any alternatives you have considered

No response

๐Ÿ“š[DOC]: Update GNN Datapipes docs after the refactor

How would you describe the priority of this documentation request

Critical (currently preventing usage)

Is this for new documentation, or an update to existing docs?

Update

Describe the incorrect/future/missing documentation

GNN Datapipes were refactored which breaks the current docs .rst files

DistributedManager with torch ProcessGroups

I'd like to use the DistributedManager alongside the metrics implemented in ensemble_metrics to calculate ensemble means of different subsets of ranks. I noticed that there is some of the functionality (at least to create different groups of processes) implemented in the distributed manager

However, this code is commented out. Is the reason that this functionality is not safe, as described on this page? I was thinking of performing this operation and using barriers to ensure that processes are synchronized.

Screen Shot 2023-06-26 at 5 55 50 PM

โ›ฐ๏ธ[EPIC]: Setting up more pre-commit hooks

Having a good suite of pre-commit hooks can help keep the code clean and ease development. We also now use the pre-commit recipes in our CI system. Additionally using pre-commit hooks before any code is pushed can help catch errors at the source reducing the burden on the CI system.

Tasks

  1. 0 - Backlog enhancement
    dallasfoster
  2. 0 - Backlog enhancement
    dallasfoster
  3. 0 - Backlog enhancement
    dallasfoster
  4. 0 - Backlog enhancement
    dallasfoster
  5. 0 - Backlog enhancement
    dallasfoster
  6. ? - Needs Triage enhancement
  7. 0 - Backlog enhancement
    NickGeneva
  8. ci enhancement

๐Ÿš€[FEA]: Add distributed FFT autograd utility

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Ability to have a distributed FFT implementation with autograd functionality

Describe any alternatives you have considered

No response

๐Ÿ›[BUG]: Fix AMP in static capture

Version

0.1.0

On which installation method(s) does this occur?

Docker, Pip, Source

Describe the issue

AMP is not properly supported in static capture, occasionally resulting in NaNs.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

Other/Misc.

No response

Error with setting value of n in ensemble mean calculation in distributed environment

When I invoke the __call__ method in Mean in metrics/general/ensemble_metrics.py in a distributed environment with modulus's distributed manager, I get the following error:

  File "/global/common/software/m1517/amahesh/fcn_mip-env/lib/python3.8/site-packages/modulus/metrics/general/ensemble_metrics.py", line 163, in __call__
    dist.all_reduce(self.n, op=dist.ReduceOp.SUM)
  File "/global/common/software/m1517/amahesh/fcn_mip-env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/global/common/software/m1517/amahesh/fcn_mip-env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

This is because n gets set to the CPU in this line. (Even if input is a CUDA tensor, torch.as_tensor(input.shape[0]) returns a torch.LongTensor, according to my tests.)

Perhaps the line above should be changed to

self.n = torch.as_tensor(input.shape[0]).to(self.device)

For the time being, I fixed the issue by using Mean's update method instead of the __call__ method, even for calculating the initial mean. This change fixed the issue.

๐Ÿ“š[DOC]: Documenting Distributed Manager

How would you describe the priority of this documentation request

Critical (currently preventing usage)

Is this for new documentation, or an update to existing docs?

New

Describe the incorrect/future/missing documentation

It would be great if there was some documentation on the capabilities of the Distributed Manager, specifically how to use it to train in multi-GPU and multi node settings. Ideally there would also be an application of this solution provided and documented of an example case in Modulus-Launch. Many thanks!

๐Ÿš€[FEA]: Add isort to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Use isort to sort imports

Describe any alternatives you have considered

ruff

๐Ÿ›[BUG]: CUDA Graph capture failures during multi-node DDP runs

Version

0.2.0

On which installation method(s) does this occur?

Docker

Describe the issue

The multi-node run fail during the CUDA Graph capture due to NCCL watchdog thread errors. The error logs look something like below:

[18:14:55] - Attempting cuda graph building, this may take a bit...
[E ProcessGroupNCCL.cpp:830] [Rank 10] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7fe97f5b295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fe97f56b69d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fe994fd7e12 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x90 (0x7fe90b6dca20 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe90b6e1708 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11b (0x7fe90b6e602b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x94 (0x7fe90b6e63d4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc2b3 (0x7fe94fcb22b3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94b43 (0x7fe9966a8b43 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a00 (0x7fe99673aa00 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 10] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This is mostly due to the following issue: pytorch/pytorch#104487 (comment)

A current workaround is to add a time delay between the warmup and the start of capture to allow the NCCL watchdogs to clean up work before starting the capture. This workaround will not be required after the Pytorch base container version used for Modulus is updated to 23.07.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

๐Ÿš€[FEA]: Add `from_jax` and `ONNX` support for Models

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Currently we can convert torch.nn.Modulus to modulus modules. Ideally we would have this support for Jax and ONNX models as well.

Describe any alternatives you have considered

No response

Error with Ensemble Mean Calculation

Do line 186 and 187 of ensemble_metrics.py calculate the right value? I think they may be incrementing self.sum and self.n of Mean by a value that is too large.

I think _update_mean returns the following quantity for n: the previous total number of elements + the number of new elements shown to the current rank (times some additional factor). Then, this quantity is summed across all ranks using torch.all_reduce. If I understand correctly, the desired behavior is not to increment self.n by this quantity reduced over all ranks. Rather, self.n should only be incremented by the number of new elements seen across all ranks. (A similar argument holds for self.sum).

To test this, I put the code below in a script called test_modulus.py and ran srun -n 2 -c 64 -G 2 python3 -u test_modulus.py

from modulus.metrics.general.ensemble_metrics import Mean, Variance        
import torch.distributed as dist                                           
from modulus.distributed import DistributedManager                         
import torch                                                            
from typing import Union, Tuple, List                                   
Tensor = torch.Tensor                                                   
                                                                        
if __name__ == '__main__':                                              
                                                                        
                                                                        
    DistributedManager.initialize()                                     
    dm = DistributedManager()                                           
    if dm.rank == 0:                                                    
        print("World size: {}".format(dm.world_size))                   
                                                                        
    tensor = torch.Tensor([[1]]).to(dm.device)                          
                                                                        
    m = Mean(tensor.shape, device=dm.device)                            
    for a in range(5):                                                  
        _ = m.update(tensor+a)                                          
        if dm.rank == 0:                                                
            print("n after {} iterations".format(a+1))                  
            print(m.n)                                                  
                                                                        
    m.finalize()                                                        
    if dm.rank == 0:                                                    
        print("Final n:")                                               
        print(m.n)                                                      
        print("Final sum: ")    [?4m                                    
        print(m.sum) 

I got this output:

World size: 2
n after 1 iterations
tensor([2], device='cuda:0', dtype=torch.int32)
n after 2 iterations
tensor([8], device='cuda:0', dtype=torch.int32)
n after 3 iterations
tensor([26], device='cuda:0', dtype=torch.int32)
n after 4 iterations
tensor([80], device='cuda:0', dtype=torch.int32)
n after 5 iterations
tensor([242], device='cuda:0', dtype=torch.int32)
Final n:
tensor([242], device='cuda:0', dtype=torch.int32)
Final sum: 
tensor([[358.]], device='cuda:0')

However, wouldn't we expect that after 2 iterations, n would be 4. After 3 iterations, n would be 6. After 4 iterations, n would be 8. And so on?

NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Rfft(1) node with name '/Rfft'

I have a model that uses torch.fft.rfft and torch.fft.irfft, since ONNX does not support these I want to use modulus.models.layers.fft.rfft and modulus.models.layers.irfft as an replacement. When I try to run the below sample code:

import modulus.models.layers.fft as fft
import torch
import onnxruntime

class Customrfft(torch.nn.Module):
    def forward(self, y):
        return fft.rfft(y, dim=-1)

t = torch.randn((1, 8, 64, 96))
model = Customrfft()

torch.onnx.export(model, t, 'test.onnx')
ort_session = onnxruntime.InferenceSession("test.onnx")
ort_outputs = ort_session.run(t)

I get error:

NotImplemented                            Traceback (most recent call last)

<ipython-input-5-8d37e437a72c> in <cell line: 4>()
      2 
      3 # Load the model
----> 4 ort_session = onnxruntime.InferenceSession("test.onnx")
      5 ort_outputs = ort_session.run(t)

1 frames

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in _create_inference_session(self, providers, provider_options, disabled_optimizers)
    433 
    434         # initialize the C++ InferenceSession
--> 435         sess.initialize_session(providers, provider_options, disabled_optimizers)
    436 
    437         self._sess = sess

NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Rfft(1) node with name '/Rfft'

Can someone please assist me on how I can implement the two functions so that I can convert my model to ONNX and use it. Thanks

๐Ÿ›[BUG]: Version main to 0.3.0a0

Version

0.2.0

On which installation method(s) does this occur?

No response

Describe the issue

Move the version of the alpha version for main branch

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

๐Ÿš€[FEA]: Name distributed utilities more appropriately

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Many distributed utilities are named as private functions, eg: _gather, when they shouldn't be private.

Describe any alternatives you have considered

See PR discussion here: #92 (comment)

๐Ÿš€[FEA]: Add a type checker to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Would be good to add a mypy pre-commit hook to catch type errors

Describe any alternatives you have considered

pyright

๐Ÿš€[FEA]: BF16 support in static capture

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

BF16 is used in GraphCast and needs to be supported in the static capture.

Describe any alternatives you have considered

No response

Additional context

No response

๐Ÿš€[FEA]: ARM support for Modulus Docker container

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Modulus docker container is currently only supported for x86_64 and amd64 architectures. Would be nice to introduce support for arm64 architectures.

Describe any alternatives you have considered

No response

๐Ÿ›[BUG]: OmegaConf's `ListConfig` and `DictConfig` are not JSON serializable

Version

0.3.0

On which installation method(s) does this occur?

No response

Describe the issue

With the recent checkpoint refactor, there is a constraint that the model arguments should be JSON serializable. The dominant approach for handling configs in Modulus is hydra configs, which relies on OmegaConf. Some of OmegaConf's data types are not serializable, including ListConfig and DictConfig, which limits the use of the new checkpointing feature with hydra configs. We need to custom JSON encoder to handle these types.

Minimum reproducible example

No response

Relevant log output

File "/usr/local/lib/python3.10/dist-packages/modulus/models/module.py", line 181, in save
    json.dump(self._args, f)
  File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ListConfig is not JSON serializable

Environment details

No response

๐Ÿ›[BUG]: Fix datapipe unit tests to not error on failure for local testing

Version

0.1.0

On which installation method(s) does this occur?

Source

Describe the issue

Pytest for datapipe currently errors because one does not have access to test datasets locally most of the time.

Unit tests should skip / xfail if the data CI fodler is not there.

They should error if this folder is present and the test fails.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

๐Ÿ›[BUG]: pytest fails locally and inside release container

Version

github commit: 012abfc

On which installation method(s) does this occur?

No response

Describe the issue

It seems the container modulus msised modulus.experimental and modulus.registry lib.
Some pytest will fail:

>> pytest modulus/test/datapipes

============================= test session starts ==============================
platform linux -- Python 3.10.6, pytest-7.4.0, pluggy-1.2.0
rootdir: /app/modulus
plugins: hydra-core-1.3.2, xdist-3.3.1, rerunfailures-11.1.2, hypothesis-5.35.1, shard-0.1.2, xdoctest-1.0.2
collected 30 items / 3 errors                                                  
Running 30 items in this shard

==================================== ERRORS ====================================
_____________ ERROR collecting test/datapipes/test_climate_hdf5.py _____________
ImportError while importing test module '/app/modulus/test/datapipes/test_climate_hdf5.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
modulus/test/datapipes/test_climate_hdf5.py:19: in <module>
    from modulus.experimental.datapipes.climate import ClimateHDF5Datapipe
E   ModuleNotFoundError: No module named 'modulus.experimental'
________________ ERROR collecting test/datapipes/test_darcy.py _________________
ImportError while importing test module '/app/modulus/test/datapipes/test_darcy.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
modulus/test/datapipes/test_darcy.py:19: in <module>
    from modulus.datapipes.benchmarks.darcy import Darcy2D
/usr/local/lib/python3.10/dist-packages/modulus/datapipes/benchmarks/darcy.py:18: in <module>
    import warp as wp
E   ModuleNotFoundError: No module named 'warp'
___________ ERROR collecting test/datapipes/test_kelvin_helmholtz.py ___________
ImportError while importing test module '/app/modulus/test/datapipes/test_kelvin_helmholtz.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
modulus/test/datapipes/test_kelvin_helmholtz.py:19: in <module>
    from modulus.datapipes.benchmarks.kelvin_helmholtz import KelvinHelmholtz2D
/usr/local/lib/python3.10/dist-packages/modulus/datapipes/benchmarks/kelvin_helmholtz.py:18: in <module>
    import warp as wp
E   ModuleNotFoundError: No module named 'warp'
=============================== warnings summary ===============================
../usr/local/lib/python3.10/dist-packages/sympy/external/importtools.py:5
  /usr/local/lib/python3.10/dist-packages/sympy/external/importtools.py:5: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
    from distutils.version import LooseVersion

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
ERROR modulus/test/datapipes/test_climate_hdf5.py
ERROR modulus/test/datapipes/test_darcy.py
ERROR modulus/test/datapipes/test_kelvin_helmholtz.py
!!!!!!!!!!!!!!!!!!! Interrupted: 3 errors during collection !!!!!!!!!!!!!!!!!!!!
========================= 1 warning, 3 errors in 4.08s =========================

Minimum reproducible example

No response

Relevant log output

No response

Environment details

**Environment:**
host:
Ubuntu 20.04.1,  A100 server
NVIDIA-SMI 525.105.17, Driver Version: 525.105.17, CUDA Version: 12.0
container:

docker run -it --gpus all --network=host --ipc=host -v $PWD:/app -w /app nvcr.io/nvidia/modulus/modulus:23.08 bash

๐Ÿš€[FEA]: Multi-GPU CI

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Multi-GPU CI is needed for testing model parallel utilities.

Describe any alternatives you have considered

No response

Additional context

No response

๐Ÿš€[FEA]: Add optional `dev` dependency to install packages needed for development

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Currently, there isn't any easy way to figure out what packages are needed for a development setup. Eg: black, pre-commit, coverage, interrogate, etc. need to be installed to be able to run local checks but this isn't easy to figure out.

The best solution would be to create an optional dependency in the pyproject.toml for a dev setup so you can install the package and dev dependencies while developing using

pip install -e .[dev]

Describe any alternatives you have considered

Can document these dependencies in the contribution guide, but it isn't as seamless to use and actually install those packages.

๐Ÿš€[FEA]: Add ability to change output size of encoder/processors in MeshGraphNet

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

In the current implementation of MeshGraphNet, the encoder and processor MLP output dimensions match the encoder hidden dimension. However, this is not a strict requirement. For example, the output dimension of the encoder could be an additional parameter, and as long as the processor block input/output and decoder input match that dimension, the architecture would still be consistent.

Describe any alternatives you have considered

No response

๐Ÿš€[FEA]: Investigate using ruff to replace flake8, bandit, isort, etc

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

ruff can do almost everything that flake8, bandit, isort, etc. can do and is supposed to be a much faster alternative as well. Moving to using ruff would mean we only need ruff in addition to black for formatting and mypy for type checking simplifying our pre-commit hooks while also making CI faster.

Describe any alternatives you have considered

๐Ÿš€[FEA]: Add a way to configure model parallel process groups

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Custom model parallelism requires splitting up a single process group into multiple sub-groups, creating orthogonal process groups, etc. Having a way for the model to describe this partitioning and the DistributedManager to understand that description would be very useful.

Describe any alternatives you have considered

No

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.