The modulus's discuss from nvidia

🚀[FEA]: Investigate using ruff to replace flake8, bandit, isort, etc

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

ruff can do almost everything that flake8, bandit, isort, etc. can do and is supposed to be a much faster alternative as well. Moving to using ruff would mean we only need ruff in addition to black for formatting and mypy for type checking simplifying our pre-commit hooks while also making CI faster.

Describe any alternatives you have considered

🐛[BUG]: Fix datapipe unit tests to not error on failure for local testing

Version

0.1.0

On which installation method(s) does this occur?

Source

Describe the issue

Pytest for datapipe currently errors because one does not have access to test datasets locally most of the time.

Unit tests should skip / xfail if the data CI fodler is not there.

They should error if this folder is present and the test fails.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🚀[FEA]: Better handling of the static dataset in GraphCast

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Currently the static dataset is handled in the init method of the GraphCast model. A This can be better handled by getting the static data from the dataloader instead.

Describe any alternatives you have considered

No response

📚[DOC]: Documenting Distributed Manager

How would you describe the priority of this documentation request

Critical (currently preventing usage)

Is this for new documentation, or an update to existing docs?

New

Describe the incorrect/future/missing documentation

It would be great if there was some documentation on the capabilities of the Distributed Manager, specifically how to use it to train in multi-GPU and multi node settings. Ideally there would also be an application of this solution provided and documented of an example case in Modulus-Launch. Many thanks!

🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

I use Modulus DistributedManager with SLURM. Right now, DistributedManager sets the local_rank based on the number of local processes on the node (this line).

local_rank = int(os.environ.get("SLURM_LOCALID"))

This line) then sets the device based on the local_rank.

manager._device = torch.device(
            f"cuda:{manager.local_rank}" if torch.cuda.is_available() else "cpu"
        )

Notably, this line breaks if "SLURM_LOCALID" is greater than torch.cuda.device_count().

In my use case, however, I need to use the SBATCH —-gpu-bind:map_gpus:0,1,2,3 flag on a node with 4 GPUs. With 4 processes per node and 4 GPUs per node, each process only sees 1 device called cuda:0, though that name actually refers to 4 different GPUs. (This forum explains why I need to use this flag.)

There may be other use cases where the number of local processes specified through SLURM may not equal the number of GPUs accessible (e.g. running FourCastNet with 4 GPUs and 1 process per GPU, but analyzing the output with more processes).

My request would be to add a flag to DistributedManager, through which I could specify that the behavior below is desired for SLURM as well.

manager._local_rank = rank % torch.cuda.device_count()

This ensures that torch.device is not called on a device that can't be accessed.

Describe any alternatives you have considered

Without a flag, DistributedManager.initialize() returns an error because torch.device is used to access a device that is not available. I could make an equivalent for DistributedManager, or I could create a subclass of DistributedManager that overrides the initialize_slurm method. Let me know if that would be the preferred solution, and I can continue with my fix on my local end.

🐛[BUG]: ERA5 DALI datapipe hangs indefinitely in multi-GPU/multi-Node setting if the datapipe size is not selected correctly.

Version

0.2.0

On which installation method(s) does this occur?

Docker

Describe the issue

This can mostly be fixed by modifying the number of samples in the datapipe (for example here) to be divisible by the number of processors/GPUs.

A long term fix would be to automatically avoid failure cases where the size is not exactly divisible by the number of GPUs.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🐛[BUG]: Local cache location should default to a modulus folder

Version

0.1.0

On which installation method(s) does this occur?

Source

Describe the issue

Default cache location should be in a modulus folder.

https://github.com/NVIDIA/modulus/blob/main/modulus/utils/filesystem.py#L31C11-L31C11

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

⛰️[EPIC]: Setting up more pre-commit hooks

Having a good suite of pre-commit hooks can help keep the code clean and ease development. We also now use the pre-commit recipes in our CI system. Additionally using pre-commit hooks before any code is pushed can help catch errors at the source reducing the burden on the CI system.

Tasks

Beta Give feedback

🚀[FEA]: Add a type checker to pre-commit hooks #121

0 - Backlog enhancement
🚀[FEA]: Add bandit to pre-commit hooks #120

0 - Backlog enhancement
🚀[FEA]: Add isort to pre-commit hooks #119

0 - Backlog enhancement
🚀[FEA]: Add flake8 to pre-commit hooks #118

0 - Backlog enhancement
🚀[FEA]: Investigate using ruff to replace flake8, bandit, isort, etc #123

0 - Backlog enhancement
🚀[FEA]: Add optional dev dependency to install packages needed for development #124

? - Needs Triage enhancement
🚀[FEA]: Add large file pre-commit hook #131

0 - Backlog enhancement
🚀[FEA]: Add line length checks back in to the ruff config #172

ci enhancement
Options

:rocket:[FEA] Check for successful install

Can we please add instructions to check that the pip install is successful?

🚀[FEA]: Add a type checker to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Would be good to add a mypy pre-commit hook to catch type errors

Describe any alternatives you have considered

pyright

🐛[BUG]: OmegaConf's `ListConfig` and `DictConfig` are not JSON serializable

Version

0.3.0

On which installation method(s) does this occur?

No response

Describe the issue

With the recent checkpoint refactor, there is a constraint that the model arguments should be JSON serializable. The dominant approach for handling configs in Modulus is hydra configs, which relies on OmegaConf. Some of OmegaConf's data types are not serializable, including ListConfig and DictConfig, which limits the use of the new checkpointing feature with hydra configs. We need to custom JSON encoder to handle these types.

Minimum reproducible example

No response

Relevant log output

File "/usr/local/lib/python3.10/dist-packages/modulus/models/module.py", line 181, in save
    json.dump(self._args, f)
  File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ListConfig is not JSON serializable

Environment details

No response

Error with setting value of n in ensemble mean calculation in distributed environment

When I invoke the __call__ method in Mean in metrics/general/ensemble_metrics.py in a distributed environment with modulus's distributed manager, I get the following error:

  File "/global/common/software/m1517/amahesh/fcn_mip-env/lib/python3.8/site-packages/modulus/metrics/general/ensemble_metrics.py", line 163, in __call__
    dist.all_reduce(self.n, op=dist.ReduceOp.SUM)
  File "/global/common/software/m1517/amahesh/fcn_mip-env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/global/common/software/m1517/amahesh/fcn_mip-env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

This is because n gets set to the CPU in this line. (Even if input is a CUDA tensor, torch.as_tensor(input.shape[0]) returns a torch.LongTensor, according to my tests.)

Perhaps the line above should be changed to

self.n = torch.as_tensor(input.shape[0]).to(self.device)

For the time being, I fixed the issue by using Mean's update method instead of the __call__ method, even for calculating the initial mean. This change fixed the issue.

🚀[FEA]: Name distributed utilities more appropriately

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Many distributed utilities are named as private functions, eg: _gather, when they shouldn't be private.

Describe any alternatives you have considered

See PR discussion here: #92 (comment)

🚀[FEA]: Add isort to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Use isort to sort imports

Describe any alternatives you have considered

ruff

🐛[BUG]: pytest fails locally and inside release container

Version

github commit: 012abfc

On which installation method(s) does this occur?

No response

Describe the issue

It seems the container modulus msised modulus.experimental and modulus.registry lib.
Some pytest will fail:

>> pytest modulus/test/datapipes

============================= test session starts ==============================
platform linux -- Python 3.10.6, pytest-7.4.0, pluggy-1.2.0
rootdir: /app/modulus
plugins: hydra-core-1.3.2, xdist-3.3.1, rerunfailures-11.1.2, hypothesis-5.35.1, shard-0.1.2, xdoctest-1.0.2
collected 30 items / 3 errors                                                  
Running 30 items in this shard

==================================== ERRORS ====================================
_____________ ERROR collecting test/datapipes/test_climate_hdf5.py _____________
ImportError while importing test module '/app/modulus/test/datapipes/test_climate_hdf5.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
modulus/test/datapipes/test_climate_hdf5.py:19: in <module>
    from modulus.experimental.datapipes.climate import ClimateHDF5Datapipe
E   ModuleNotFoundError: No module named 'modulus.experimental'
________________ ERROR collecting test/datapipes/test_darcy.py _________________
ImportError while importing test module '/app/modulus/test/datapipes/test_darcy.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
modulus/test/datapipes/test_darcy.py:19: in <module>
    from modulus.datapipes.benchmarks.darcy import Darcy2D
/usr/local/lib/python3.10/dist-packages/modulus/datapipes/benchmarks/darcy.py:18: in <module>
    import warp as wp
E   ModuleNotFoundError: No module named 'warp'
___________ ERROR collecting test/datapipes/test_kelvin_helmholtz.py ___________
ImportError while importing test module '/app/modulus/test/datapipes/test_kelvin_helmholtz.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
modulus/test/datapipes/test_kelvin_helmholtz.py:19: in <module>
    from modulus.datapipes.benchmarks.kelvin_helmholtz import KelvinHelmholtz2D
/usr/local/lib/python3.10/dist-packages/modulus/datapipes/benchmarks/kelvin_helmholtz.py:18: in <module>
    import warp as wp
E   ModuleNotFoundError: No module named 'warp'
=============================== warnings summary ===============================
../usr/local/lib/python3.10/dist-packages/sympy/external/importtools.py:5
  /usr/local/lib/python3.10/dist-packages/sympy/external/importtools.py:5: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
    from distutils.version import LooseVersion

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
ERROR modulus/test/datapipes/test_climate_hdf5.py
ERROR modulus/test/datapipes/test_darcy.py
ERROR modulus/test/datapipes/test_kelvin_helmholtz.py
!!!!!!!!!!!!!!!!!!! Interrupted: 3 errors during collection !!!!!!!!!!!!!!!!!!!!
========================= 1 warning, 3 errors in 4.08s =========================

Minimum reproducible example

No response

Relevant log output

No response

Environment details

**Environment:**
host:
Ubuntu 20.04.1,  A100 server
NVIDIA-SMI 525.105.17, Driver Version: 525.105.17, CUDA Version: 12.0
container:

docker run -it --gpus all --network=host --ipc=host -v $PWD:/app -w /app nvcr.io/nvidia/modulus/modulus:23.08 bash

Add better optional dependency handling to pyproject.toml

Some dependencies are currently installed via Dockerfile, would be great to have them setup as optional dependencies in the pyproject.toml

🚀[FEA]: BF16 support in static capture

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

BF16 is used in GraphCast and needs to be supported in the static capture.

Describe any alternatives you have considered

No response

Additional context

No response

🚀[FEA]: ARM support for Modulus Docker container

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Modulus docker container is currently only supported for x86_64 and amd64 architectures. Would be nice to introduce support for arm64 architectures.

Describe any alternatives you have considered

No response

Make nvfuser optional

The base wheel of pytorch does not have nvfusor and will not work for local installs. I highly recommend making this optional and have a fallback if users aren't using nvfusor.

modulus/modulus/models/layers/fused_silu.py

Line 18 in bad7e00

from torch._C._nvfuser import Fusion, FusionDefinition, DataType

🚀[FEA]: Update SFNO to use `DistributedManager` without model parallelism

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

SFNO currently uses a separate comm_v2 module for distributed setup. This should be changed so it can use the DistributedManager like all other models and utils in Modulus

Describe any alternatives you have considered

No response

Additional context

No response

🚀[FEA]: Add support for cosine zenith and static datasets to the climate datapipe

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Cosine zenith is used in the SFNO model but is not supported in the Modulus dataloader for climate. Static datasets can be also handled in the dataloader.

Describe any alternatives you have considered

No response

Additional context

No response

Rebase 0.2.0 branch into main

Error with Ensemble Mean Calculation

Do line 186 and 187 of ensemble_metrics.py calculate the right value? I think they may be incrementing self.sum and self.n of Mean by a value that is too large.

I think _update_mean returns the following quantity for n: the previous total number of elements + the number of new elements shown to the current rank (times some additional factor). Then, this quantity is summed across all ranks using torch.all_reduce. If I understand correctly, the desired behavior is not to increment self.n by this quantity reduced over all ranks. Rather, self.n should only be incremented by the number of new elements seen across all ranks. (A similar argument holds for self.sum).

To test this, I put the code below in a script called test_modulus.py and ran srun -n 2 -c 64 -G 2 python3 -u test_modulus.py

from modulus.metrics.general.ensemble_metrics import Mean, Variance        
import torch.distributed as dist                                           
from modulus.distributed import DistributedManager                         
import torch                                                            
from typing import Union, Tuple, List                                   
Tensor = torch.Tensor                                                   
                                                                        
if __name__ == '__main__':                                              
                                                                        
                                                                        
    DistributedManager.initialize()                                     
    dm = DistributedManager()                                           
    if dm.rank == 0:                                                    
        print("World size: {}".format(dm.world_size))                   
                                                                        
    tensor = torch.Tensor([[1]]).to(dm.device)                          
                                                                        
    m = Mean(tensor.shape, device=dm.device)                            
    for a in range(5):                                                  
        _ = m.update(tensor+a)                                          
        if dm.rank == 0:                                                
            print("n after {} iterations".format(a+1))                  
            print(m.n)                                                  
                                                                        
    m.finalize()                                                        
    if dm.rank == 0:                                                    
        print("Final n:")                                               
        print(m.n)                                                      
        print("Final sum: ")    [?4m                                    
        print(m.sum)

I got this output:

World size: 2
n after 1 iterations
tensor([2], device='cuda:0', dtype=torch.int32)
n after 2 iterations
tensor([8], device='cuda:0', dtype=torch.int32)
n after 3 iterations
tensor([26], device='cuda:0', dtype=torch.int32)
n after 4 iterations
tensor([80], device='cuda:0', dtype=torch.int32)
n after 5 iterations
tensor([242], device='cuda:0', dtype=torch.int32)
Final n:
tensor([242], device='cuda:0', dtype=torch.int32)
Final sum: 
tensor([[358.]], device='cuda:0')

However, wouldn't we expect that after 2 iterations, n would be 4. After 3 iterations, n would be 6. After 4 iterations, n would be 8. And so on?

🚀[FEA]: Add optional `dev` dependency to install packages needed for development

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Currently, there isn't any easy way to figure out what packages are needed for a development setup. Eg: black, pre-commit, coverage, interrogate, etc. need to be installed to be able to run local checks but this isn't easy to figure out.

The best solution would be to create an optional dependency in the pyproject.toml for a dev setup so you can install the package and dev dependencies while developing using

pip install -e .[dev]

Describe any alternatives you have considered

Can document these dependencies in the contribution guide, but it isn't as seamless to use and actually install those packages.

📚[DOC]: Update the contribution guideline to mention every PR needs to point to an existing issue

How would you describe the priority of this documentation request

Medium

Is this for new documentation, or an update to existing docs?

New

Describe the incorrect/future/missing documentation

We can also add a bullet point to the PR template checklist for this.

⛰️[EPIC]: Model Parallel SFNO Integration

Tracking model parallel SFNO integration into Modulus core / launch

Core

Beta Give feedback

DistributedManager with torch ProcessGroups #58

enhancement external
🚀[FEA]: Update SFNO to use DistributedManager without model parallelism #79

1 - On Deck enhancement
🚀[FEA]: BF16 support in static capture #75

2 - In Progress enhancement
🐛[BUG]: OmegaConf's ListConfig and DictConfig are not JSON serializable #133

2 - In Progress bug
🚀[FEA]: Add a way to configure model parallel process groups #142

3 - Ready for Review distributed enhancement
🚀[FEA]: Add utility in DistributedManager to take a ProcessGroupConfig and create all the appropriate torch process groups #154

0 - Backlog distributed enhancement
Options

Launch

Beta Give feedback

🚀[FEA]: Basic unified training recipe for AFNO & SFNO (non-parallel) modulus-launch#48

2 - In Progress enhancement
🚀[FEA]: Enhance hydra support for unified config management modulus-launch#49

0 - Backlog enhancement
Options

🐛[BUG]: Fix labels for templates

Version

0.2.0

On which installation method(s) does this occur?

No response

Describe the issue

Templates labels need correction and add the needs triage label by default.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🐛[BUG]: CI Docker Image Has Modulus files still present in site packages

Version

0.3.0a0

On which installation method(s) does this occur?

Docker

Describe the issue

Presently the CI docker image has an issue where modulus install files are still present on the system. This is because the editable install still places some linked files in the site packages.

This causes issues because pip will not install modulus since it detects its already installed and on the highest version despite the package not truly being there.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🚀[FEA]: Add bandit to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Add bandit hook for security scanning

Describe any alternatives you have considered

flake8-bandit

ruff

🚀[FEA]: Batched support for the DLWP wrapper/plugin

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Currently the MIP plugin of DLWP only supports a single input. Batched support would be needed for better/faster inferencing workflows.

Describe any alternatives you have considered

No response

🚀[FEA]: Add large file pre-commit hook

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Set max file size to 500kb

Describe any alternatives you have considered

No response

🚀[FEA]: Multi-GPU CI

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Multi-GPU CI is needed for testing model parallel utilities.

Describe any alternatives you have considered

No response

Additional context

No response

DistributedManager with torch ProcessGroups

I'd like to use the DistributedManager alongside the metrics implemented in ensemble_metrics to calculate ensemble means of different subsets of ranks. I noticed that there is some of the functionality (at least to create different groups of processes) implemented in the distributed manager

However, this code is commented out. Is the reason that this functionality is not safe, as described on this page? I was thinking of performing this operation and using barriers to ensure that processes are synchronized.

🚀[FEA]: Add a way to configure model parallel process groups

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Custom model parallelism requires splitting up a single process group into multiple sub-groups, creating orthogonal process groups, etc. Having a way for the model to describe this partitioning and the DistributedManager to understand that description would be very useful.

Describe any alternatives you have considered

No

🚀[FEA]: Update min-max normalization to Gaussian normalization in the Ahmed body datapipe

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Gaussian normalization has shown improvements in accuracy compared to min-max normalization. Consider switching.

Describe any alternatives you have considered

No response

🐛[BUG]: Fix AMP in static capture

Version

0.1.0

On which installation method(s) does this occur?

Docker, Pip, Source

Describe the issue

AMP is not properly supported in static capture, occasionally resulting in NaNs.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

Other/Misc.

No response

NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Rfft(1) node with name '/Rfft'

I have a model that uses torch.fft.rfft and torch.fft.irfft, since ONNX does not support these I want to use modulus.models.layers.fft.rfft and modulus.models.layers.irfft as an replacement. When I try to run the below sample code:

import modulus.models.layers.fft as fft
import torch
import onnxruntime

class Customrfft(torch.nn.Module):
    def forward(self, y):
        return fft.rfft(y, dim=-1)

t = torch.randn((1, 8, 64, 96))
model = Customrfft()

torch.onnx.export(model, t, 'test.onnx')
ort_session = onnxruntime.InferenceSession("test.onnx")
ort_outputs = ort_session.run(t)

I get error:

NotImplemented                            Traceback (most recent call last)

<ipython-input-5-8d37e437a72c> in <cell line: 4>()
      2 
      3 # Load the model
----> 4 ort_session = onnxruntime.InferenceSession("test.onnx")
      5 ort_outputs = ort_session.run(t)

1 frames

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in _create_inference_session(self, providers, provider_options, disabled_optimizers)
    433 
    434         # initialize the C++ InferenceSession
--> 435         sess.initialize_session(providers, provider_options, disabled_optimizers)
    436 
    437         self._sess = sess

NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Rfft(1) node with name '/Rfft'

Can someone please assist me on how I can implement the two functions so that I can convert my model to ONNX and use it. Thanks

🐛[BUG]: Add sklearn as a dependency

Version

0.1.0

On which installation method(s) does this occur?

Pip

Describe the issue

sklearn is used in GraphCast for graph construction and needs to be added as a dependency

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🐛[BUG]: Error with using modulus docker image as base image in dockerfile

Version

23.05 (docker)

On which installation method(s) does this occur?

Docker

Describe the issue

I am trying to use the docker image (nvcr.io/nvidia/modulus/modulus:23.05) as a starting point in a Dockerfile: I'd like to add some of my own packages on top. However, I get the error reported below when I try to use it as a starting point. The first line in the Dockerfile (FROM nvcr.io/nvidia/modulus/modulus:23.05) causes the issue.

From googling it appears that there can be many reasons for getting the error message I get (e.g. a problem with NGC or not logging in. When I try to use a different NVIDIA base image, (e.g. FROM nvcr.io/nvidia/pytorch:$PYT_VER-py3 as env), then everything works smoothly.

If you have any guidance on how to avoid this error, that would be much appreciated.

Minimum reproducible example

FROM nvcr.io/nvidia/modulus/modulus:23.05

Relevant log output

=> ERROR [internal] load metadata for nvcr.io/nvidia/modulus/modulus:23.05                                                                                                     0.4s

failed to solve with frontend dockerfile.v0: failed to create LLB definition: failed to authorize: rpc error: code = Unknown desc = failed to fetch anonymous token: unexpected status: 401 Unauthorized

Environment details

I get the error when I run

docker build  -t amahesh19/modulus_base:0.1 .

in the directory with the Dockerfile.

🐛[BUG]: CUDA Graph capture failures during multi-node DDP runs

Version

0.2.0

On which installation method(s) does this occur?

Docker

Describe the issue

The multi-node run fail during the CUDA Graph capture due to NCCL watchdog thread errors. The error logs look something like below:

[18:14:55] - Attempting cuda graph building, this may take a bit...
[E ProcessGroupNCCL.cpp:830] [Rank 10] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7fe97f5b295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fe97f56b69d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fe994fd7e12 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x90 (0x7fe90b6dca20 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe90b6e1708 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11b (0x7fe90b6e602b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x94 (0x7fe90b6e63d4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc2b3 (0x7fe94fcb22b3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94b43 (0x7fe9966a8b43 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a00 (0x7fe99673aa00 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 10] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This is mostly due to the following issue: pytorch/pytorch#104487 (comment)

A current workaround is to add a time delay between the warmup and the start of capture to allow the NCCL watchdogs to clean up work before starting the capture. This workaround will not be required after the Pytorch base container version used for Modulus is updated to 23.07.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🚀[FEA]: Add ability to change output size of encoder/processors in MeshGraphNet

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

In the current implementation of MeshGraphNet, the encoder and processor MLP output dimensions match the encoder hidden dimension. However, this is not a strict requirement. For example, the output dimension of the encoder could be an additional parameter, and as long as the processor block input/output and decoder input match that dimension, the architecture would still be consistent.

Describe any alternatives you have considered

No response

🐛[BUG]: Container 23.08 is missing warp installation

Version

23.08

On which installation method(s) does this occur?

NGC container nvcr.io/nvidia/modulus/modulus:23.08

Describe the issue

The NGC container for version 23.08 does not include warp, which is used in eg modulus/datapipes/benchmarks/darcy.py .

Minimum reproducible example

$ docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia --rm -it nvcr.io/nvidia/modulus/modulus:23.08 bash
$ python
>>> import warp

Relevant log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'warp'

Environment details

nvcr.io/nvidia/modulus/modulus:23.08

🚀[FEA]: Add distributed FFT autograd utility

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Ability to have a distributed FFT implementation with autograd functionality

Describe any alternatives you have considered

No response

🐛[BUG]: Better handle the Tarball extracts

Version

main

On which installation method(s) does this occur?

Source

Describe the issue

Unsafe use of tar.extractall() here. For more details please see: https://bandit.readthedocs.io/en/1.7.5/plugins/b202_tarfile_unsafe_members.html

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🐛[BUG]: Version main to 0.3.0a0

Version

0.2.0

On which installation method(s) does this occur?

No response

Describe the issue

Move the version of the alpha version for main branch

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

🚀[FEA]: Add `from_jax` and `ONNX` support for Models

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem you would like to solve.

Currently we can convert torch.nn.Modulus to modulus modules. Ideally we would have this support for Jax and ONNX models as well.

Describe any alternatives you have considered

No response

🚀[FEA]: Add flake8 to pre-commit hooks

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Add better style guide enforcement using flake8

Describe any alternatives you have considered

ruff

🚀[FEA]: Add S3 testing for filesystem abstraction

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Should add testing for the S3 branch of _download_cached(...)

Describe any alternatives you have considered

Shelling out to the aws cli as before but that has security risks

Additional context

Should test both with recursive=True and recursive=False

📚[DOC]: Update GNN Datapipes docs after the refactor

How would you describe the priority of this documentation request

Critical (currently preventing usage)

Is this for new documentation, or an update to existing docs?

Update

Describe the incorrect/future/missing documentation

GNN Datapipes were refactored which breaks the current docs .rst files

🐛[BUG]: Security issue due to certifi

Version

0.2.1

On which installation method(s) does this occur?

Pip

Describe the issue

Version update of certifi needed. GHSA-xqr8-7jwr-rhp7

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

nvidia / modulus Goto Github PK

modulus's Issues

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

How would you describe the priority of this documentation request

Is this for new documentation, or an update to existing docs?

Describe the incorrect/future/missing documentation

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

Tasks

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Additional context

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Additional context

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Additional context

Is this a new feature, an improvement, or a change to existing functionality?