Giter Site home page Giter Site logo

Comments (11)

tmorgan4 avatar tmorgan4 commented on May 20, 2024 2

By default Tensorflow will grab all available memory on the GPU when the first process is created and all subsequent processes will fail since no memory remains. This behavior can be changed by using the 'allow_growth' option which allows the memory for each process to dynamically expand as needed. This is covered in detail here: https://www.tensorflow.org/tutorials/using_gpu

With that said, these asynchronous algorithms are not optimized for GPU and, in my experience, perform much worse when forced to run on GPU. The A3C algorithm was released first and then a GPU friendly version called A2C was released some time later. Something similar was done with PPO as OpenAI has released a PPO2 algorithm which is optimized for GPU. These GPU optimized algorithms trade async behavior for batches which is where GPUs really perform.

It would be great to compare performance between different systems but I've noticed the global_step/sec parameter Andrew has implemented in the tensorboard monitor is greatly affected by many settings making it difficult to compare. Best I have seen to date is around 1800 global_steps/sec using A3C with close to default settings on a dual Xeon 2669 workstation with 18 workers.

from btgym.

Kismuz avatar Kismuz commented on May 20, 2024 2

@vincetom1980,
yes indeed, @tmorgan4 comment is right to the point: A3C is good for those who doesn't have access to cheap GPU resources (like me :). As for performance, here is post from A2C developers:
https://blog.openai.com/baselines-acktr-a2c/
which can be summarised as 'it's better to run A2C on GPU than a3C on CPU'.

global_step/sec parameter Andrew has implemented in the tensorboard monitor is greatly affected by many settings making it difficult to compare.

yes, not a good metric since here golbal_step is defined not as ' number of algorithm training iterations' but as 'number of environment steps made so far by all workers' and better be named sampling number or so. I found it more convenient for this particular task. Anyway train_global_step is easy to insert.

BTW I have included option to run several environments for each worker in a batch like this:

cluster_config = dict(
    host='127.0.0.1',
    port=12230,
    num_workers=4,  
    num_ps=1,
    num_envs=4,
    log_dir=os.path.expanduser('~/tmp/test_4_8'),
)
  • here 4 x 4 environment instances will be run. and each worker will get batch of four rollouts for train step; but it seems such setup further lowers sampling efficiency and I'm not confident if it can be run well on GPU.

from btgym.

JaCoderX avatar JaCoderX commented on May 20, 2024 1

@tmorgan4, @Kismuz Thank you both for your replies.

I think for now I'll stick to CPU :)

from btgym.

vincetom1980 avatar vincetom1980 commented on May 20, 2024

tmorgan4,

Thanks for your comments!

Tom

from btgym.

vincetom1980 avatar vincetom1980 commented on May 20, 2024

Andrew,
Thank you for your Answer!

Tom

from btgym.

JaCoderX avatar JaCoderX commented on May 20, 2024

A3C algorithm was designed to work on 'workers' that run on CPU. So running the whole framework on GPU doesn't make a lot of sense.

But what about running specific parts on GPU?

I'm currently experimenting with conv_1d_casual_encoder using a large time_dim =4096 . My problem is because it adds a lot more parameterization to the model, every step computation take considerably more time.

So I was thinking maybe I can wrap only the encoder with tf.device('/gpu:0'): and force the encoder block to run under GPU. This way everything would run on CPU except the convolution part that is know for working very well under GPU.

I've made the following changes to the code:

  • wrapped the encoder
def conv_1d_casual_encoder(
...
with tf.device('/gpu:0'):
    with tf.variable_scope(name_or_scope=name, reuse=reuse):
...
  • added a config to the Session for GPU log/control.
    **not sure it was the right place to put the config but tensorboard.py was the only place I found tf.Session() in the code
class BTgymMonitor():
...
config=tf.ConfigProto(log_device_placement=True, allow_soft_placement=False)
config.gpu_options.allow_growth=True
self.sess = self.tf.Session(config=config)

I ran a test using only 1 worker but I couldn't get it to work. (error: cuda is out of memory)
I even get this error if I use the with tf.device('/gpu:0'): on a simple operation inside the encoder

The log Tensorflow generate shows there is an active GPU and 'tensorflow gpu' is the only version installed (and works properly).

I'm having a hard time understanding the source of the problem.

hopefully there is a solution, as CPU power is not enough to experiment on large time_dim efficiently.

from btgym.

Kismuz avatar Kismuz commented on May 20, 2024

@JacobHanouna,

A3C algorithm was designed to work on 'workers' that run on CPU. So running the whole framework on GPU doesn't make a lot of sense.

  • there exists extension of A3C optimised for GPU named A3G;
  • another option is batched version: A2C, both can be found on arxiv / github

class BTgymMonitor()

...is deprecated and not related at all, do not use it; for proper place to configure distributed TF device placement see:

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/worker.py#L195

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L439

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/launcher/base.py#L264

some explanations:

https://www.tensorflow.org/deploy/distributed

from btgym.

JaCoderX avatar JaCoderX commented on May 20, 2024

Thanks for the guidance @Kismuz
I will try to give it a go :)

from btgym.

JaCoderX avatar JaCoderX commented on May 20, 2024

I've been reading for a couple of hours both the code and on tensorflow distributed, not the most easy topic to follow.

this is what I understand so far:

  • we define the number of workers in cluster_config and pass it to the launcher.
  • the launcher will then give each worker a task number and all the config data to run A3C.
  • all workers are then instantiated in worker class
  • each worker will then instantiated A3C model in the BaseAAC class
  • in the A3C class the worker will get assigned with a device that make it bound to CPU calculation.

I'm not sure how to modify the code so I have another worker that will be bound to GPU and would not be part of the A3C.

not looking for something pretty, just a way to use something like with tf.device("/job:worker/task:{}/gpu:0".format(task)): over the encoder block

from btgym.

Kismuz avatar Kismuz commented on May 20, 2024

@JacobHanouna, it is correct except that's essential to understand that it is tensorflow graph (or even specific part of it) that get assigned to specific device, not python object or process (instance of worker etc.);

In a nutshell, there are replicas of graph assigned to each worker process and one replica held by parameter server process; later receives trainable parameters updates from worker's graphs (to be exact it gets computed gradients and applies to own variables following optimiser rule); than each worker copies updated variables to own graph to work with.

That's big topic indeed with a lot of pitfalls and I do recommend to dig github for some good-written distributed code from big guys; there is no guaranties that if even one correctly assigns computation-heavy part of the graph ops to GPU device there will be no lock-ups due to workers concurrency; thats why A2C is more efficient here: it forces each worker to put it own batch in synchronous manner, concatenates everything batch-wise and sends to GPU in single pass.

from btgym.

tmorgan4 avatar tmorgan4 commented on May 20, 2024

@JacobHanouna Making this work on GPU will require a fair amount of rework. You are most likely getting the 'cuda is out of memory' error because Tensorflow by default will grab all available memory on the device in the first session so all other workers don't see any available memory.

You actually posted the solution above (from BTgymMonitor, which is not being used) where you need to specify 'config.gpu_options.allow_growth = True'. This will tell Tensorflow to allocate a small amount of memory to start and expand as needed. You can also specify a fraction of memory for each process to allocate if that is more convenient. It's all covered in detail under 'Allowing GPU memory growth'.
https://www.tensorflow.org/guide/using_gpu

As an aside, I have been digging through @Kismuz's code for a long time (a year?) and just finally understanding how certain parts work together. Andrew has done an extraordinary job especially considering he's done it nearly all himself.

from btgym.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.