Giter Site home page Giter Site logo

ReverbDataset on TPU about reverb HOT 34 CLOSED

weichseltree avatar weichseltree commented on August 22, 2024
ReverbDataset on TPU

from reverb.

Comments (34)

sabelaraga avatar sabelaraga commented on August 22, 2024

Hey Manuel,

The ReverbDataset op is CPU only, so you have to runt he ReverbDataset on CPU and you can use the distribute_datasets_from_function. It gets a dataset_fn that executes on CPU and transfers the data to the device. See examples of usage in TF Agents .

Let us know if it doesn't help.

Sabela.

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

Thank you very much for your reply. Unfortunately, I cannot get it to work. If you would like to find out what's going wrong, please take a look at this Colab Notebook i wrote that recreates my issue:

https://github.com/weichseltree/reverb_dataset_on_tpu/blob/main/reverb_dataset_on_tpu.ipynb

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Just to rule out other issues, can you check if you're have the same TF version in your VM and in the TPU (https://cloud.google.com/tpu/docs/version-switching?hl=en)?

I think in the colab you don't need with tf.device('/CPU:0'): inside experience_dataset_fn.

Thanks!

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

I tried your suggestion, it doesn't fix the error.
I am using Tensorflow version 2.4.1 and the TPU is set to software version 2.4.1 as well.
The code runs fine when the TPU is not connected.

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

For some reason it seems that is trying to run the dataset op in the TPU. One thing I find weird in the colab is that the iterator is created outside of the strategy scope and the creation of the dataset is inside. Afaik, the creation of the dataset can be in or outside of the strategy scope (see the final section here).

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

If I play around with scope and device placement, nothing changes.

The error can occur in different places. In my notebook it fails on the .batch() operation but if I don't call it it will simply fail on the .prefetch() or next(dataset) operation. So the root of the problem must be in the initialization of the ReverbDataset or the way it interacts with the interleave() op.

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Hey Manuel, I think we found the issue and it has to do with the software architecture of the Cloud TPUs (https://cloud.google.com/tpu/docs/system-architecture#software_architecture): the imports are not available to the binary that runs on the TPU host.

The solution is to use tf.data.service (this is an example to launch it with GKE) and register the dataset there.

I'll try to find some time to run the full workflow, but leaving here the pointers in case you want to give it a try!

Sabela.

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

Okay that looks promising. I tried the setup with a test dataset with two processes:

  1. first process not connecting to the TPU and running the DispatchServer as well as the WorkerServer
  2. second process connecting to the TPU and trying to fetch the dataset via tf.data.experimental.service.from_dataset_id

The second process now gets stuck at the next(iterator) call.
Without connecting the second process to the TPU it works fine, so I guess I will have to wait for your solution.

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Thanks for giving this a try! One question, is the DispatchServer running in the same colab as next(iterator)?

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

Since I am not aware that running multiple notebooks on the same Colab instance is possible, I just tried it on a GCP instance.
The DispatchServer is not running in the same notebook/process as the next(iterator) call. The two parts above really just mean two python processes running on the same machine.

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

Any updates on this?

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Two quick questions:

  • When you say 'gets stuck' I guess it's not logging any error
  • (Not totally related) are you in the TFRC program?

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024
  • No it seems to have trouble connecting from the TPU, but there is no error. I made sure I don't use 'localhost' but the IP address of the machine running the DispatchServer, so I guess my setup should have worked.
  • I'm not related to the TFRC right now, but we would love to be if that's possible as a startup. I used to receive credits for using TPUs a few years ago for an unrelated research project, which was really nice.

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

And other couple of questions (to see if this is a problem of tf.data service with reverb or if there is something else):

  • if you use the tf.data.service and the other process doesn't run on TPU, does it work or does it get stuck as well?
  • When using the TPU & the tf.data service, does it get stuck on sampling if you register a non-reverb dataset (e.g., tf.data.Dataset.from_tensor_slices(...))?

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

I actually didn't use the ReverbDataset in my tests. The code I use is this:
First notebook:

import tensorflow as tf

dispatcher = tf.data.experimental.service.DispatchServer()
dispatcher_address = dispatcher.target.split("://")[1]
worker = tf.data.experimental.service.WorkerServer(
    tf.data.experimental.service.WorkerConfig(
        dispatcher_address=dispatcher_address))
dataset = tf.data.Dataset.range(10)
dataset_id = tf.data.experimental.service.register_dataset(
    dispatcher.target, dataset)
print(dataset_id)
print(dispatcher.target)

It then prints the dataset_id, usually 1000, and the target, something like 'grpc://localhost:44771'.
I then use another notebook to connect to it like this:

import os
import tensorflow as tf

print("Tensorflow version " + tf.__version__)
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver('<my-tpu-node-name>')  # TPU detection
    print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
    raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.TPUStrategy(tpu)

dataset = tf.data.experimental.service.from_dataset_id(
    processing_mode="parallel_epochs",
    service='grpc://<my-gcp-local-ip-address>:<dispatch-server-port>', # something like 'grpc://10.156.0.34:44771'
    dataset_id=<dataset-id>, # usually 1000
    element_spec=tf.TensorSpec(shape=(), dtype=tf.int64))

iterator = iter(dataset)
print(next(iterator))

The code runs fine if I don't connect to the TPU but gets stuck if I connect to the TPU. So there seems to be an issue with the tf.data.service on the TPU.
Both notebooks and the TPU node run tensorflow 2.4.1.

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Thanks for trying this out and confirming the details.

I think that the problem is still that the TPU worker cannot connect to the notebook runtime, and it's necessary to run the tf.data service separately (not in a notebook). This tutorial shows how to run it on a GKE cluster, and TPU should be able to communicate with it.

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Atually, it might be easier to make sure the Notebook runs in the same VPC network as the TPU (see how to configure the Notebook instance here).

from reverb.

weichseltree avatar weichseltree commented on August 22, 2024

I eventually figured out how to connect the TPU to the DispatchServer using a non-reverb dataset by opening the ports involved.

However, when I switch back to the ReverbDataset, the original issue reappears:

tensorflow.python.framework.errors_impl.NotFoundError: Failed to register dataset: Op type not registered 'ReverbDataset' in binary running on <gcp-instance-name>. Make sure the Op and Kernel are registered in the binary running in this process.

This code can be used to reproduce the issue, I didn't even connect to any TPU.

import tensorflow as tf
import acme
import acme.datasets

import logging
logging.getLogger('tensorflow').setLevel(logging.DEBUG)

dispatcher = tf.data.experimental.service.DispatchServer(
    tf.data.experimental.service.DispatcherConfig(port=5050))
dispatcher_address = dispatcher.target.split("://")[1]
worker = tf.data.experimental.service.WorkerServer(
    tf.data.experimental.service.WorkerConfig(
        dispatcher_address=dispatcher_address))

dataset = acme.datasets.make_reverb_dataset(
    'localhost:9999',
    environment_spec=acme.specs.EnvironmentSpec(
                          observations=tf.TensorSpec((), dtype=tf.float32),
                          actions=tf.TensorSpec((), dtype=tf.float32),
                          rewards=tf.TensorSpec((), dtype=tf.float32),
                          discounts=tf.TensorSpec((), dtype=tf.float32)),
    table='training',
    batch_size=1,
    prefetch_size=tf.data.experimental.AUTOTUNE,
    sequence_length=10)

tf.data.experimental.service.register_dataset(
    dispatcher.target, dataset)

input()

tensorflow: 2.4.1
dm-acme: 0.2.0
dm-reverb: 0.2.0

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Hey Manuel, sorry for the late reply. Do you have the notebook code used to access the dataset and reproduce the error?

Thanks!

from reverb.

thisiscam avatar thisiscam commented on August 22, 2024

Hi,

I just noticed this thread. I would like some clarification on this too: in the setup by @weichseltree, are you using tensorflow (instead of, for example, JAX)? The notebook https://github.com/weichseltree/reverb_dataset_on_tpu/blob/main/reverb_dataset_on_tpu.ipynb link is dead unfortunately so I can only guess...

Based on my understanding per this comment, it's OK to run custom tensorflow's CPU ops on the VM attached to a TPU. Perhaps it's possible to build a standalone TF custom op for the reverb dataset, and then load it via tf.load_op_library (if this is not done already, that is)?
Please correct me if I'm wrong though!

Also to my understanding, this won't be useful unless one is using TensorFlow, or using an alpha release that allows direct ssh access to the TPU VM.

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Hey,
IIRC, the colab was all in TF.

The reverb dataset is already a TF custom op that runs on CPU, but the binary running on the TPU worker doesn't have access to it.

But you're right, the new architecture should solve the problem.

from reverb.

thisiscam avatar thisiscam commented on August 22, 2024

but the binary running on the TPU worker doesn't have access to it.

I see. Thanks for confirming.

from reverb.

thisiscam avatar thisiscam commented on August 22, 2024

Just curious, will it work if you put the shared library file in a GCS bucket though?

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Not sure what you mean. In this case, if you run a tf.data.service on a separate server, the TPU worker should fetch tensors with a tf.data.service client without having to know that they come from a Reverb Dataset.

from reverb.

thisiscam avatar thisiscam commented on August 22, 2024

I think TPUs can access cloud storage bucket: https://cloud.google.com/tpu/docs/storage-buckets.
I'm thinking that perhaps one can put the compiled shared library file for the custom reverb dataset op (e.g. reverb_custom_ops.so or some similar name) into a GCS bucket, and somehow direct the TPU VM to load the shared library.

I suppose this sounds like it is probably disallowed...

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Yeah, I'm not an expert on how the TPU VMs access Cloud storage, but I don't think the TPU worker could load that.

from reverb.

ebrevdo avatar ebrevdo commented on August 22, 2024

I've asked Zak Stone to weigh in on this.

from reverb.

ebrevdo avatar ebrevdo commented on August 22, 2024

Looks like the new single-VM TPU service is now available. You should be able to connect to a TPU host directly and run your python code there. A couple of nice advantages include being able to colocate your reverb server and the learner in the same process, allowing Reverb to bypass the RPC part and do zero-copy transfer between the two.

from reverb.

manuel-weichselbaum avatar manuel-weichselbaum commented on August 22, 2024

Hello everyone. It's been a while, sorry for abandoning this issue.
I'm now working on the setup just like @ebrevdo described. I am on a Cloud TPU VM which comes with tensorflow 2.6.0 preinstalled. When I try to adapt the /usr/share/tpu/tensorflow/simple_example.py file by adding import reverb after import tensorflow as tf I get the following error:

Traceback (most recent call last):
  File "simple_example.py", line 4, in <module>
    import reverb
  File "/usr/local/lib/python3.8/dist-packages/reverb/__init__.py", line 27, in <module>
    from reverb import item_selectors as selectors
  File "/usr/local/lib/python3.8/dist-packages/reverb/item_selectors.py", line 19, in <module>
    from reverb import pybind
  File "/usr/local/lib/python3.8/dist-packages/reverb/pybind.py", line 1, in <module>
    import tensorflow as _tf; from .libpybind import *; del _tf
ImportError: libtensorflow_framework.so.2: cannot open shared object file: No such file or directory

I cannot find the file in the tensorflow install directory, this seems to be a change from 2.5 to 2.6 maybe?

Simply copying the one from tensorflow 2.5 results in another error:

Traceback (most recent call last):
  File "simple_example.py", line 4, in <module>
    import reverb
  File "/usr/local/lib/python3.8/dist-packages/reverb/__init__.py", line 27, in <module>
    from reverb import item_selectors as selectors
  File "/usr/local/lib/python3.8/dist-packages/reverb/item_selectors.py", line 19, in <module>
    from reverb import pybind
  File "/usr/local/lib/python3.8/dist-packages/reverb/pybind.py", line 1, in <module>
    import tensorflow as _tf; from .libpybind import *; del _tf
ImportError: /usr/local/lib/python3.8/dist-packages/reverb/libpybind.so: undefined symbol: _ZN4absl12lts_202103245MutexD1Ev

Do I have to compile reverb on the Cloud TPU VM?
Thank you in advance for pointing me in the correct direction :)

from reverb.

ebrevdo avatar ebrevdo commented on August 22, 2024

Looks like you'll need to use a version of reverb built for your version of TF. @tfboyd just released a new minor version to match TF 2.6.0. Give that a try?

from reverb.

tfboyd avatar tfboyd commented on August 22, 2024

0.4.0 was compiled with TF 2.6.0 and should match up as expected. https://pypi.org/project/dm-reverb/#history

from reverb.

manuel-weichselbaum avatar manuel-weichselbaum commented on August 22, 2024

Okay, I think I see what's happening. A Cloud TPU VM comes with a custom tf-nightly 2.6.0 build. This version will give the errors above. When I install the default tensorflow 2.6.0 and reverb 0.4.0 everything imports nicely. However, then the TPU ops are not available when I initialize the system as shown in the tutorial https://cloud.google.com/tpu/docs/tensorflow-quickstart-tpu-vm: InvalidArgumentError: No OpKernel was registered to support Op 'ConfigureDistributedTPU' (this also happens with the dm-reverb-nightly[tensorflow] version)

So I guess I have to build reverb to match the pre-installed version. Is that possible? How would I deviate from the build from source tutorial in that scenario?

from reverb.

MorganeAyle avatar MorganeAyle commented on August 22, 2024

Hello,

I am also getting the following error when trying to sample from the ReplayDataset after having initialized a TPU:
tensorflow.python.framework.errors_impl.NotFoundError: 'ReverbDataset' is neither a type of a primitive operation nor a name of a function registered in binary running on n-9f826cf4-w-0. Make sure the operation or function is registered in the binary running in this process. [Op:DeleteIterator]

I am using tensorflow 2.4.1 and reverb 0.2.0 versions. Everything works fine when sampling and training on a GPU with the same code. I tried the above mentioned suggestions but with no luck. Was anyone able to solve this issue?

from reverb.

sabelaraga avatar sabelaraga commented on August 22, 2024

Since the last report on this issue is from 2021, and things have changed significantly since then, I'm going to close it. Please reopen (or create a new issue) if you experience the same problem.

from reverb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.