tensorflow / mesh Goto Github PK

Mesh TensorFlow: Model Parallelism Made Easier

License: Apache License 2.0

Python 99.90% Shell 0.10%

mesh's Issues

Regarding data and model parallelism of mnist python code in examples

I have made changes to the mnist.py in the examples section, as documented in the GitHub I have made the changes to achieve data parallelism and model parallelism. I have collected nvprof files for each of them. It seems to be a bit off. Because p2p interaction is happening in data parallelism but not in model parallelism. I went back and checked and re-created the files but still it looks the same. I am attaching the screenshots of nvprof. I have done this using 4 GPU's. I am also attaching the nvprof files.

link for model parallelism nvprof file:
https://drive.google.com/open?id=1omQ_neb7eUgmDRnYMmLUyKzD2inO4Kai

link for data parallelism nvprof file:
https://drive.google.com/open?id=1MHGdzexNIcV9L66x1VkUQ11DBcM5H_qv

tf2 in mesh_tensorflow/utils.py incompatible with tensor2tensor/rl

I'm wondering if tf2 is absolutely needed in mesh_tensorflow/utils.py? I'm trying to reproduce on the provided Google colab https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/rl
with tensorflow 1.13.1 and T2T 1.13.1 (the recommended config), but I got stuck at line 26 import tensorflow.compat.v2 as tf2 because I'm using tensorflow 1.13.1

Would it be possible to make mesh_tensorflow compatible with tensorflow v1?

Capture performance profile using Tensorboard

I would like to debug training/fine-tuning performance of mesh transformer on CPU/GPU.
Is it possible to capture performance profile using Tensorboard?
If so, is there an example or tutorial that I can follow?

Can you go across multiple nodes?

Is it possible to use devices that are on different machines? For example, in Horovod I can specify the IP addresses of multiple machines and do data parallelism across them. However, this requires me to specifically have MPI setup on each machine. It's unclear to me if this can be done with TF Mesh. Maybe with a tf.train.clusterspec and the parameter server model??

Thanks.
-Tony

Mesh Tensorflow

Any plans to support Mesh Tensorflow?

[Bug Fix] Evaluation and Prediction for Aligned model

Hello,

Both evaluation and prediction currently not working with the aligned model "Bert Style".

I have fixed this issue by adding a new if statement in "transformer/utils.py":

    elif mode == tf.estimator.ModeKeys.PREDICT:
      inputs = mtf_features["inputs"]
      if predict_fn:
        mtf_samples = predict_fn(
            model=transformer_model,
            features=mtf_features,
            variable_dtype=get_variable_dtype())
      elif isinstance(transformer_model, transformer.Unitransformer) and model_type == 'aligned':
        # pad so that there is enough room for the targets
        inputs = mtf.pad(
            inputs, [0, sequence_length["targets"]], length_dim.name)
        logits, _ = transformer_model.call_simple(
            inputs=inputs, variable_dtype=get_variable_dtype(),
            compute_loss=False,
            mode=tf.estimator.ModeKeys.PREDICT)

        label_c_dim = mtf.Dimension('vocab', 256)
        mtf_samples = mtf.argmax(logits, label_c_dim)

As well as "transformer/transformer.py" needs to be modified :

  def call_simple(self,
                  inputs = None,
                  targets = None,
                  compute_loss = False,
                  mode=tf.estimator.ModeKeys.TRAIN,
                  variable_dtype=mtf.VariableDType(tf.float32),
                  sequence_id=None,
                  subsequence_id=None,
                  position=None,
                  encoder_output=None,
                  encoder_sequence_id=None,
                  encoder_inputs=None,
                  shared_params=None,
                  layer_outputs=None,
                  encoder_layer_outputs=None,
                  num_microbatches=1):

The only thing that I am currently defining manually is "label_c_dim".
@adarob @craffel @nshazeer It will be great if you could merge my code or defining a better solution and find an automatic way to find the vocab size for "label_c_dim".

Non autoregressive Predict and Evaluate doesn’t Work

Hi,

I am using Google T5 library which is based on TensorFlow mesh for training a non-autoregressive model like Bert.

The training running without a problem, but both the prediction and the evaluation don't work because of the Unitransformer model expects only an autoregressive model for decoding.

ERROR:tensorflow:Error recorded from prediction_loop: must be autoregressive
  In call to configurable 'sample_autoregressive' (<function Unitransformer.sample_autoregressive at 0x7f2a3276a620>)
INFO:tensorflow:prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-86647c2a14e0> in <module>()
      2 model.eval(
      3     mixture_or_task_name="ss3",
----> 4     checkpoint_steps="all"
      5 )

31 frames
/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/transformer.py in sample_autoregressive(self, partial_sequences, stop_at_token, max_steps, temperature, variable_dtype, encoder_output, encoder_sequence_id, encoder_inputs, shared_params, has_partial_sequences, encoder_layer_outputs, never_end, remove_partial_sequences, sampling_keep_top_k)
    778     """
    779     if not self.autoregressive:
--> 780       raise ValueError("must be autoregressive")
    781 
    782     inputs = partial_sequences

ValueError: must be autoregressive
  In call to configurable 'sample_autoregressive' (<function Unitransformer.sample_autoregressive at 0x7f2a3276a620>)
  In call to configurable 'decode' (<function decode at 0x7f2a3270eb70>)

This is the gin file, that I have used:

import mesh_tensorflow.optimize
import mesh_tensorflow.transformer.learning_rate_schedules
import mesh_tensorflow.transformer.transformer_layers
import t5.models.mesh_transformer
import t5.data.sentencepiece_vocabulary

# Macros:
# ==============================================================================
d_ff = 3072
d_kv = 64
d_model = 768
dropout_rate = 0.1
MIXTURE_NAME = 'ss3'
num_heads = 12
num_layers = 12
model_parallelism = 1
split= "train"
tokens_per_batch = 65536

# Parameters for AdafactorOptimizer:
# ==============================================================================
AdafactorOptimizer.beta1 = 0.0
AdafactorOptimizer.clipping_threshold = 1.0
AdafactorOptimizer.decay_rate = None
AdafactorOptimizer.epsilon1 = 1e-30
AdafactorOptimizer.epsilon2 = 0.001
AdafactorOptimizer.factored = True
AdafactorOptimizer.min_dim_size_to_factor = 128
AdafactorOptimizer.multiply_by_parameter_scale = True

# Parameters for denoise:
# ==============================================================================
denoise.inputs_fn = @preprocessors.noise_span_to_unique_sentinel
denoise.noise_density = 0.15
denoise.noise_mask_fn = @preprocessors.iid_noise_mask
denoise.targets_fn = @preprocessors.nonnoise_span_to_unique_sentinel

# Parameters for DenseReluDense:	
# ==============================================================================	
DenseReluDense.dropout_rate = %dropout_rate	
DenseReluDense.hidden_size = %d_ff	

# Parameters for drop_noise_tokens:	
# ==============================================================================	
# None.	

# Parameters for drop_nonnoise_tokens:	
# ==============================================================================	
# None.

# Parameters for get_dataset:
# ==============================================================================

# Parameters for get_sentencepiece_model_path:
# ==============================================================================
get_sentencepiece_model_path.mixture_or_task_name = %MIXTURE_NAME

# Parameters for get_variable_dtype:
# ==============================================================================
get_variable_dtype.activation_dtype = 'bfloat16'

# Parameters for iid_noise_mask:
# ==============================================================================
# None.

# Parameters for LayerStack:
# ==============================================================================
LayerStack.dropout_rate = %dropout_rate	
LayerStack.norm_epsilon = 1e-06

# Parameters for learning_rate_schedule_noam:
# ==============================================================================
learning_rate_schedule_noam.linear_decay_fraction = 0.0
learning_rate_schedule_noam.multiplier = 1.0
learning_rate_schedule_noam.offset = 0
learning_rate_schedule_noam.warmup_steps = 10000

# Parameters for make_layer_stack:
# ==============================================================================
make_layer_stack.block_scope = True	
make_layer_stack.layers = \
    [@mesh_tensorflow.transformer.transformer_layers.SelfAttention,	
     @mesh_tensorflow.transformer.transformer_layers.DenseReluDense]	
make_layer_stack.num_layers = %num_layers

# Parameters for mesh_train_dataset_fn:
# ==============================================================================
mesh_train_dataset_fn.mixture_or_task_name = %MIXTURE_NAME

# Parameters for noise_span_to_unique_sentinel:
# ==============================================================================
# None.

# Parameters for nonnoise_span_to_unique_sentinel:
# ==============================================================================
# None.

# Parameters for pack_dataset:
# ==============================================================================

# Parameters for pack_or_pad:
# ==============================================================================
# None.

# Parameters for rate_num_examples:
# ==============================================================================
rate_num_examples.maximum = 524288
rate_num_examples.scale = 1.0
rate_num_examples.temperature = 1.0

# Parameters for reduce_concat_tokens:
# ==============================================================================
reduce_concat_tokens.batch_size = 128
reduce_concat_tokens.feature_key = 'targets'

# Parameters for run:
# ==============================================================================
run.autostack = True
run.batch_size = ('tokens_per_batch', %tokens_per_batch)
run.dataset_split = %split
run.ensemble_inputs = None
run.eval_checkpoint_step = None
run.eval_dataset_fn = None
run.eval_summary_dir = None
run.export_path = ''
run.iterations_per_loop = 100
run.keep_checkpoint_max = None
run.layout_rules = \
    'ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch'
run.learning_rate_schedule = @learning_rate_schedules.learning_rate_schedule_noam
run.mesh_shape = @mesh_tensorflow.transformer.utils.tpu_mesh_shape()
run.mode = 'train'
run.model_type = 'aligned'
run.optimizer = @optimize.AdafactorOptimizer
run.perplexity_eval_steps = 10
run.predict_fn = None
run.save_checkpoints_steps = 5000
run.sequence_length = {'inputs': 512, 'targets': 512}
run.train_dataset_fn = @t5.models.mesh_transformer.mesh_train_dataset_fn
run.train_steps = 786432
run.variable_filter = None
run.vocabulary = @t5.data.sentencepiece_vocabulary.SentencePieceVocabulary()

# Parameters for select_random_chunk:
# ==============================================================================
select_random_chunk.feature_key = 'targets'
select_random_chunk.max_length = 65536

# Parameters for SelfAttention:
# ==============================================================================
SelfAttention.attention_kwargs = None	
SelfAttention.dropout_rate = %dropout_rate	
SelfAttention.key_value_size = %d_kv	
SelfAttention.num_heads = %num_heads	
SelfAttention.num_memory_heads = 0	
SelfAttention.relative_attention_num_buckets = 32	
SelfAttention.relative_attention_type = 'bias_shared'	
SelfAttention.shared_kv = False

# Parameters for SentencePieceVocabulary:
# ==============================================================================
SentencePieceVocabulary.extra_ids = 100
SentencePieceVocabulary.sentencepiece_model_file = \
    @t5.models.mesh_transformer.get_sentencepiece_model_path()

# Parameters for serialize_num_microbatches:
# ==============================================================================
serialize_num_microbatches.tokens_per_microbatch_per_replica = 2048

# Parameters for split_tokens:
# ==============================================================================
split_tokens.feature_key = 'targets'
split_tokens.min_tokens_per_segment = None

# Parameters for split_tokens_to_inputs_length:
# ==============================================================================
# None.

# Parameters for tpu_estimator_model_fn:
# ==============================================================================
tpu_estimator_model_fn.outer_batch_size = 1
tpu_estimator_model_fn.tpu_summaries = False

# Parameters for tpu_mesh_shape:
# ==============================================================================
tpu_mesh_shape.ensemble_parallelism = None
tpu_mesh_shape.model_parallelism = %model_parallelism
tpu_mesh_shape.tpu_topology = %tpu_topology

# Parameters for Unitransformer:
# ==============================================================================
Unitransformer.d_model = %d_model	
Unitransformer.ensemble = None	
#Unitransformer.input_full_attention = True	
Unitransformer.label_smoothing = 0.0	
Unitransformer.loss_denominator = None	
Unitransformer.loss_fn = None	
Unitransformer.loss_on_targets_only = False	
Unitransformer.max_length = 512	
Unitransformer.name = 'transformer'	
Unitransformer.positional_embedding = True	
Unitransformer.shared_embedding_and_softmax_weights = True	
Unitransformer.vocab_divisor = 128	
Unitransformer.z_loss = 0.0001

# Parameters for unsupervised:
# ==============================================================================
unsupervised.preprocessors = \
    [@preprocessors.select_random_chunk,
     @preprocessors.reduce_concat_tokens,
     @preprocessors.split_tokens_to_inputs_length,
     @preprocessors.denoise]

Is there a solution for that or currently the non-autoregressive doesn't work for eval and predict ?

README.md is outdated

Run the Transfomer model (no Tensor2Tensor dependencies)
The file "examples/transformer_standalone.py" does not exist anymore.
There is a pull request
and transformer_standalone.py has been replaced by mesh_tensorflow.transformer.main
9 months ago, but it wasn't merged with the main branch.

Mesh Tensorflow requires `tensorflow.python.tpu.ops `?

I am running an experiment that requires:

tensorflow==1.13.1 or tensorflow-gpu==1.13.1
tensor2tensor==1.11.0

In tensor2tensor==1.11.0 and mesh-tensorflow==0.1.1, it imports mesh_tensorflow which further imports tensorflow.python.tpu.ops :

import mesh_tensorflow as mtf
#File "/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/__init__.py", line 26, in <module>     
from mesh_tensorflow import simd_mesh_impl
  #File "/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/simd_mesh_impl.py", line 32, in <module>
from tensorflow.python.tpu.ops import tpu_ops  # pylint: disable=g-direct-tensorflow-import
#ModuleNotFoundError: No module named 'tensorflow.python.tpu'

In my version of TF 1.13.1 there is no tensorflow.python.tpu. Any way to fix this error? Which version of mesh_tensorflow should I downgrade to?

Could you please set to False the default value of ignore_comments?

Could you please set to False the default value of ignore_comments?

mesh/mesh_tensorflow/transformer/utils.py

Line 766 in 7de6e9b

def get_inputs_from_file(input_filename, ignore_comments=True):

I'm using T5 and it took me a while to find out why some of the lines in the input files were being discarded.

Incorrect tensorflow dependency requirements

In setup.py, the tensorflow requirement is >=1.15. However, in mesh_tensorflow.utils:

with tf.summary.create_file_writer(model_dir).as_default():

here tf.summary is a 2.0 module. so when using gin config

utils.tpu_estimator_model_fn.tpu_summaries = True

it throws an error with tensorflow 1.15:

Convolution layers in mesh tensorflow

I like to run the following Keras example deduced from here

# 1D CNN neural network
model_m = Sequential()
model_m.add(Reshape((TIME_PERIODS, num_sensors), input_shape=(input_shape,)))
model_m.add(Conv1D(100, 10, activation='relu', input_shape=(TIME_PERIODS, num_sensors)))
model_m.add(Conv1D(100, 10, activation='relu'))
model_m.add(MaxPooling1D(3))
model_m.add(Conv1D(160, 10, activation='relu'))
model_m.add(Conv1D(160, 10, activation='relu'))
model_m.add(GlobalAveragePooling1D())
model_m.add(Dropout(0.5))
model_m.add(Dense(num_classes, activation='softmax'))

on more than 1 machine (maybe having two nodes of CPUs each having multiple cores). Can I use mesh_tensorflow graphs for convolutional layers?
I like to apply both data and spatial parallelism on this example (maybe on a bigger data) on two identical machines. Would you please help me with this? I couldn't find many examples of using TFMesh.
Thanks

Question on params['context']

In the toy_model_tpu.py exampe, params['context'] is used to understand device assignments and host placements. Where is its value populated?

def model_fn(features, labels, mode, params):
...
if FLAGS.use_tpu:
ctx = params['context']

Distributed Mesh-TF

I want to run mnist.py example via mpirun to use devices from different nodes, ¿it is possible actually?

Layers and Session Support

I have an image classification model defined in Keras that I'm attempting to parallelize with MTF. However, it's not clear to me whether MTF support exists for keras.layers/tf.layers or if I'll need to recreate my model in MTF. Does MTF support keras.layers or tf.layers?
Does MTF exclusively use sessions for training or is there support for TF 2.0 eager execution?

If the answer is "no" to either of the above questions, is there any plan to add support in the future?

Support for MultiworkerMirroredStrategy?

Is it possible to incorporate MultiworkerMirroredStrategy into Mesh TF? I would like to run model + data parallelism on a supercomputer that has multiple GPUs on multiple nodes.

It seems that, by default, MultiworkerMirroredStrategy uses all possible GPUs and replicates the model across nodes, making model parallelism by Mesh TF difficult to run on multiple nodes.

README Questions

Hi there,

Thanks for creating this framework. I was trying to run the transformer example provided in the README.md and I realized some files are missing in the repository.
Could you please update those files?

For example, examples/transformer_standalone.py is missing. I looked at the commit history and still could not find it. Seems like it missed to push in.

python examples/transformer_standalone.py --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL --gin_file=$LAYOUT --gin_param="run.mode='train'"

Version:
Tensorflow : v1.13
mesh-tensorflow : head of the repo.

(Sorry I could not label as per contribution guidelines as the permissions are not available to do. )

Finetuning a `bfloat16` checkpoint with `float32`

I'm trying to fine-tuning a released T5 checkpoint in float32,
but I get the following error:

2020-09-03 16:33:42.380962: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name =
/block_018/layer_002/layer_norm/scale; expected dtype float does not equal original dtype bfloat16

Is what I'm trying to do supported? These are the relevant parts I set:
--gin_param="get_variable_dtype.activation_dtype = 'float32'"
--gin_param="get_variable_dtype.master_dtype = 'float32'"
--gin_param="get_variable_dtype.slice_dtype = 'float32'"
--gin_file="gs://t5-data/pretrained_models/3B/operative_config.gin"

(We explicitly want float32)

Add mtf-nightly

We need a nightly package so that, for example, Tensor2Tensor's open source does not break when it runs Travis builds using the latest functionality here.

[Bug] brackets missing

Hello

I found a bug (brackets missing) about mesh tensorflow.
Please check the code.
https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/simd_mesh_impl.py
Line 882: return _default_value
It should be return _default_value() , right ?

Regards,
Andy

Support for multiple meshes and efficient communication between them

For example, we may want to load training data on a mesh of 64 cpu-machines and infeed them to a mesh of 512 tpu-cores. We do not need this for language tasks where the data is tiny, but it will be important for other tasks.

GPipe vs mesh?

Any comments about GPipe which was supposed to be open sourced by Google soon?

Looks like both GPipe and Mesh can do model/data parallelism.

mtf.pow gradient unstable

easy to get NaNs when x is close to 0

Examples of image-classification models

@NikiP: this is resolved, right?

_infer_binary_broadcast_shape should allow dimensions of size 1

currently broadcasting semantics aren't the same as regular tensorflow

Preventing leak in packed sequences

When packing is done here https://github.com/tensorflow/mesh/blob/6a812c8bb847e081e976533ed497c7c5016bb1ec/mesh_tensorflow/transformer/dataset.py
Each packed sequence has multiple examples ("segments"). I'm trying to figure out where do you prevent information to leak between these examples (e.g in attention).

I came across this

mesh/mesh_tensorflow/layers.py

Line 1813 in 4db643b

def attention_mask_same_segment(

But I see it is not used anywhere.

I can't seem to find where the information leak is prevented elsewhere. Can you clarify?

mixed precision support on GPUs

Hi,
To speed up training on V100 GPUs, I'd like to run mesh tf using mixed precision. While TensorFlow has an easy to use automatic mixed precision feature, it requires the optimizer to be a tf.train.Optimizer. This won't work on mesh tf's optimizers.

My question is: how can I use mixed precision on GPUs with mesh tf? If not supported yet, can you add some support for this? Thanks.

Performance on GPUs and multiple GPU support

We tried to run Mesh-TensorFlow to train T5 on GPUs following the instructions on T5's repository, but the training is extremely slow.

global_step/sec: 0.0467347
examples/sec: 0.186939

The training script successfully detected GPUs (showing "Adding visible gpu devices: ..."), but most of computation seems to run on a CPU.
By enabling log_device_placement, we can see many operators on both CPUs and GPUs.
ProfilerHook showed that it actually uses both, but I couldn't know if the behavior is expected or not.

I am wondering if Mesh-TensorFlow runs on GPUs in a practical sense.
I found an issue that mentioned a similar problem, but it was closed with no answer (#35).

I also failed to find reliable documents about training on multiple GPUs.
An existing issue #20 mentioned the same question, but no answer was given.

I appreciate if someone could give us any information regarding the above questions.

package published to pypi is broken?

cluster@master0:~/diseaseTools$ clear
cluster@master0:~/diseaseTools$ docker run -it python:3.6-jessie sh
# pip install mesh-tensorflow

Collecting mesh-tensorflow
  Downloading https://files.pythonhosted.org/packages/7b/9a/8f46d2bf6ecc8f622a4d3a7a9838c340bf0e6523a2bfc2a56a0ce870d2d8/mesh_tensorflow-0.0.1-py2.py3-none-any.whl
Collecting six (from mesh-tensorflow)
  Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting future (from mesh-tensorflow)
  Downloading https://files.pythonhosted.org/packages/00/2b/8d082ddfed935f3608cc61140df6dcbf0edea1bc3ab52fb6c29ae3e81e85/future-0.16.0.tar.gz (824kB)
    100% |████████████████████████████████| 829kB 21.6MB/s
Building wheels for collected packages: future
  Running setup.py bdist_wheel for future ... done
  Stored in directory: /root/.cache/pip/wheels/bf/c9/a3/c538d90ef17cf7823fa51fc701a7a7a910a80f6a405bf15b1a
Successfully built future
Installing collected packages: six, future, mesh-tensorflow
Successfully installed future-0.16.0 mesh-tensorflow-0.0.1 six-1.11.0
# # python
Python 3.6.6 (default, Oct 16 2018, 07:22:54)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mesh_tensorflow as mtf
>>> mtf.Graph()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'mesh_tensorflow' has no attribute 'Graph'
>>> mtf.__path__
['/usr/local/lib/python3.6/site-packages/mesh_tensorflow']
>>> quit()
# ls /usr/local/lib/python3.6/site-packages/mesh_tensorflow
__init__.py  __pycache__  import_test.py
#

as shown there is nothing inside the package.

When I do the equivalent with the dev install, pip install -e "git+https://github.com/tensorflow/mesh.git#egg=mesh-tensorflow" things work.

mtf.dropout is inverted

mtf.dropout(x, 0.1) means dropout with 90% probability.

tf.dropout(x, 0.1) means dropout with 10% probability.

For around a month, this has caused an agonizing bug with a GPT project that was ported to mesh tensorflow.

Is there a reason this is inverted? Is it too late to change? If not, you might want to issue some sort of warning, somewhere. Although mtf doesn't explicitly say that it's compatible with the tf api, it was somewhat shocking to end-users that it inverted a basic operation.

easy: implement mtf.expand_dims and mtf.squeeze

mtf.expand_dims can be implemented in terms of stack
mtf.squeeze can be implemented in terms of reduce_sum + some sanity checks

setting MTF_SEQUENCE_MODE while using PlacementMeshImpl results in error [low priority]

  File "/root/code/src/mesh/mesh_tensorflow/ops.py", line 656, in copy_masters_to_slices
    return mesh_impl.copy_master_to_slice_ops[-1]
AttributeError: 'PlacementMeshImpl' object has no attribute 'copy_master_to_slice_ops'

Perhaps sequence mode should be a flag to SimdMeshImpl instead of an environment variable

Regarding change in code that will convert layout to use both model and data parallelism

mesh_shape = [("processor_rows", 2), ("processor_cols", 2)]
layout_rules = [("batch", "processor_rows"), ("hidden", "processor_cols")]

The above code change is mentioned to be using both model and data parallelism. But we will get "mesh_size error". So we need to change the value for mesh_size also. It should be *mesh_size=len(mesh_shape)len(mesh_shape[0])

Split along layers

Is it possible to split it such that layers are split along some dimension of the mesh too?

For example:

Mesh shape: x:16,y:32
Layout: layers: x, hidden: y

If I had 32 layers, for example, I'd like the result to have 2 layers on the first slice of x, 2 layers on the next slice, etc. Ideally, something like GPipe where the forward and backward passes are pipelined so that 15/16ths of the devices don't sit idle would be preferable, but even being able to do the split naïvely would be useful.

The memory consumption does not include the backwards phase?

Dear authors,

I have read the code of auto-mesh. I found that when calculating the memory consumption given a schedule, it only included the consumption by the forward phase, but did not include the backward phase. This confused me, because the backpropagation also produces new data in memory.

Is there something I missed? or you did this way on purpose?

thanks for your answering,
Xiaoda

PROBLEM=./mesh_tensorflow/transformer/gin/problems/lm1b.gin

Line 13 in lm1b.gin "dataset.get_tfds_vocabulary.dataset_name = %dataset_name"
causes an error

There is no function named "get_tfds_vocabulary"
in /mesh_tensorflow/transformer/dataset.py

To fix the error the line can be replaced with
"vocabulary.get_tfds_vocabulary.dataset_name=%dataset_name"

ScalarMultiplyOperation accepts non-scalar tensors

not a big deal, but can lead to some accidental bugs. possibly a strict check would be better

Communication Between TPU Cores and Encoder->Reduce->Decoder Pattern

My understanding from the readme is that there is some flexibility in the TPU Mesh, but all operations must replicated on all TPU cores.

Will there ever be support for reducing an encoder split across 8 cores to run a decoder on a single core?

Effectively, the graph would take an input of (cores * bs, other shapes) and the output would simply be (1, other shapes). A example usage would be encoding a set of tweets and outputting a single summary.

Mesh overlapping is allowed ?

Hi, does MTF support overlapped meshes? For example, for a NN model with 6 layers, I want to parallelize three first layers with 1d mesh and three remaining with 2d mesh. These two meshes are overlapping on a 4 devices. If it's not allow in MTF, is there any solution to do that?

Operations necessary for spatial-partitioning (spatially-partitioned convolution, etc)

@NikiP: this is resolved right?

.travis.yml: The 'sudo' tag is now deprecated in Travis CI

Travis are now recommending removing the sudo tag.

"If you currently specify sudo: false in your .travis.yml, we recommend removing that configuration"

Does mesh tensorflow really support GPU training?

Hi, I have been trying to use mesh tensorflow on GPUs. I ran the mnist.py example to test the speed using GPU and CPU, by setting CUDA_VISIBLE_DEVICES variable (removed convolutional layers due to cuDNN version). However, using GPUs I obtained 80-100 global_steps/sec, and got similars values using CPU. I originally doubts the real support for GPUs from my attempts to train T5 model using GPUs. Do you have a working example that demonstrates the support for GPUs, particularly on the aspect of speed?

Set up website under tensorflow.org

Alternatively, we may not want to commit to more open-source platforms (this website, but also a mailing list). Instead, we may want to look into how Mesh TF could be merged into core TF. If that's the future, this TODO would only be useful for the short-term.

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

When I was running the mnist.py, it occurred that in mnist_dataset.py, function download,
os.remove(zipped_filepath) couldn't work due to PermissionError.
Therefore, changing this code into this might works.
try: os.remove(zipped_filepath) except PermissionError: pass

SelfAttention & EncDecAttention in mesh transformer allow different values for query, key, value

This paper Low-Rank Bottleneck in Multi-head Attention Models suggests that we could fix the head size and keep hidden size unchanged. Could you support setting d_k, d_q, d_v independently instead of d_kv.

[t2t_vocabulary.py:32] Failed to load tensor2tensor

In mesh_tensorflow/transformer/t2t_vocabulary.py, it need to import subword_text_encoder_ops from tensor2tensor.data_generators.ops. However, there is no subword_text_encoder_ops.py in tensor2tensor repository. Tensor2tensor only contains subword_text_encoder_ops.cc file.

Running on multiple GPU

Hello I am trying to run the mnist python code in example section. When I tried to run them I observed they are using only 1 GPU, for all the three data parallelism, model parallelism , data and model parallelism. How can I make them to run on multiple GPU's.

Support for training with multiple TPUs

The mtf_transformer in Tensor2Tensor defaults to a mesh configuration for TPUs that uses 32 cores or 4 Cloud TPUs. I wasn't able to find documentation on utilizing more than a single Cloud TPU, but I tried it anyway with TPU_NAME=grpc://tpu0:8470,grpc://tpu1:8470 and got an error:

*** InternalError: Invalid system configuration: 1x1 host topology with 0 missing hosts, but 2 hosts in total.

I am using TF 1.11.0 and the meshTF in Tensor2Tensor 1.9.0, for compatibility with Cloud TPU.

mtf.reduce_mean crashes when reducing over no elements

culprit:
return reduce_sum(x, output_shape=output_shape) * (output_shape.size / x.shape.size)

Desired behavior:

when reduced dimension is size 0, should return a tensor of NaNs
more importantly, when reduced dimension is non-zero, should just return a new tensor of size zero

related bug: division by 0 shouldn't crash, should return +- inf
relevant line: return ScalarMultiplyOperation(x1, 1.0 / x2).outputs[0]

Running the transformer model with Tensor2Tensor using Mesh-Tensorflow(GPU implementation)

I am trying to run the transformer model with Tensor2tensor using mesh-tensorflow (GPU-implementation) but I am facing few errors.

steps to reproduce:

PROBLEM=translate_enfr_wmt32k
MODEL=mtf_transformer
HPARAMS=mtf_transformer_paper_tr_0_mesh_8
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
datagen:
t2t-datagen
--data_dir=$DATA_DIR
--tmp_dir=$TMP_DIR
--problem=$PROBLEM
train:
t2t-trainer
--data_dir=$DATA_DIR
--problem=$PROBLEM
--model=$MODEL
--hparams_set=$HPARAMS
--output_dir=$TRAIN_DIR
--train_steps=10

error
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Multiple OpKernel registrations match NodeDef '{{node transformer/dropout/binary_op/parallel_0_1/Less}}': 'op: "Less" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_BFLOAT16 } } }' and 'op: "Less" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_BFLOAT16 } } }'
[[transformer/dropout/binary_op/parallel_0_1/Less]]

tensorflow / mesh Goto Github PK

mesh's Issues

Recommend Projects

Recommend Topics

Recommend Org