Giter Site home page Giter Site logo

Comments (8)

bcharlier avatar bcharlier commented on August 28, 2024

Hi @timlacroix ,

when you set CUDA_VISIBLE_DEVICES=0, the nvidia driver expose only the GPU with id=0. So at compilation time, keops only detect the gpu with id=0. This is thus the expected behavior.

Can you try to call python test.py without setting the env variable CUDA_VISIBLE_DEVICE. It should work as you expected, as you already ask keops to run on gpu 0 through to the .to('cuda:0')...

For instance:

import torch
from pykeops.torch import LazyTensor

def test(data):
	neigh_state = LazyTensor(data[None, :, :])
	state = LazyTensor(data[:, None, :])
	all_distances = ((neigh_state - state) ** 2).sum(dim=2)
	return (- all_distances).logsumexp(dim=1)

tensor = torch.randn(10,128).to('cuda:0')
test(tensor) # should run on gpu 0

tensor1 = torch.randn(10,128).to('cuda:1')
print(torch.cuda.device_count())
test(tensor) # should run on gpu 1... without recompiling

from keops.

timlacroix avatar timlacroix commented on August 28, 2024

hi, I used CUDA_VISIBLE_DEVICES here to make the problem reproducible.

My question is in a set-up where development (and thus compilation) happens on a machine with N gpus and test happens on a machine with M gpus, but sharing the same compilation cache.

Couldn't the number of GPUs available at compile time be used in the compiled code hash ? This way, changing the number of GPUs would just force a rebuild, but wouldn't raise an error.

from keops.

bcharlier avatar bcharlier commented on August 28, 2024

Maybe @joanglaunes know that better than me, but I think it will not be possible to make the same shared lib working on 2 different system. Why don't you define 2 separated cache folder ?

from keops.

bcharlier avatar bcharlier commented on August 28, 2024

hi, I used CUDA_VISIBLE_DEVICES here to make the problem reproducible.

My question is in a set-up where development (and thus compilation) happens on a machine with N gpus and test happens on a machine with M gpus, but sharing the same compilation cache.

Couldn't the number of GPUs available at compile time be used in the compiled code hash ? This way, changing the number of GPUs would just force a rebuild, but wouldn't raise an error.

ok, a quick solution could be: include the number of gpu and their respected arch in the name of the cache folder. So when you call your code from different node, it will get the sharedlib from the right cache dir

from keops.

timlacroix avatar timlacroix commented on August 28, 2024

@bcharlier yes, if that's possible that would be great :)

from keops.

bcharlier avatar bcharlier commented on August 28, 2024

is the hostname unique in your case? I mean, is one of those output different on the various nodes :

import platform
print(platform.node())

import socket
print(socket.gethostname())

from keops.

timlacroix avatar timlacroix commented on August 28, 2024

both are different on various nodes.
(However, I might want to vary the number of GPUs available at runtime on the same machine, for exemple while developing, I have two things running on 1 GPU, then at some point I want to try 1 thing on 2 GPUs ...)

I don't know if including the hostname is a good idea. This means using a separate cache folder per machine which the user can do if necessary by just using a random cache at runtime. In my case I would be happy to re-use the same cache for various nodes of the cluster.

from keops.

joanglaunes avatar joanglaunes commented on August 28, 2024

Hello @timlacroix ,
In fact the technical problem for us is that detection of Gpus and their properties is currently done at compilation in the Cmake scripts that are launched after the Python code detected there is a need for compilation.
So as @bcharlier is suggesting the easiest solution for us is to include hostname and node (+ the content of CUDA_VISIBLE_DEVICES maybe) in the hash code, because this is easy to do with Python.
However ok, maybe including the Gpu properties in the hash code is not so difficult, I guess it can be done with GPUtils...

from keops.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.