tensorflow / mesh Goto Github PK
View Code? Open in Web Editor NEWMesh TensorFlow: Model Parallelism Made Easier
License: Apache License 2.0
Mesh TensorFlow: Model Parallelism Made Easier
License: Apache License 2.0
I have made changes to the mnist.py in the examples section, as documented in the GitHub I have made the changes to achieve data parallelism and model parallelism. I have collected nvprof files for each of them. It seems to be a bit off. Because p2p interaction is happening in data parallelism but not in model parallelism. I went back and checked and re-created the files but still it looks the same. I am attaching the screenshots of nvprof. I have done this using 4 GPU's. I am also attaching the nvprof files.
link for model parallelism nvprof file:
https://drive.google.com/open?id=1omQ_neb7eUgmDRnYMmLUyKzD2inO4Kai
link for data parallelism nvprof file:
https://drive.google.com/open?id=1MHGdzexNIcV9L66x1VkUQ11DBcM5H_qv
I'm wondering if tf2 is absolutely needed in mesh_tensorflow/utils.py? I'm trying to reproduce on the provided Google colab https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/rl
with tensorflow 1.13.1 and T2T 1.13.1 (the recommended config), but I got stuck at line 26 import tensorflow.compat.v2 as tf2
because I'm using tensorflow 1.13.1
Would it be possible to make mesh_tensorflow compatible with tensorflow v1?
I would like to debug training/fine-tuning performance of mesh transformer on CPU/GPU.
Is it possible to capture performance profile using Tensorboard?
If so, is there an example or tutorial that I can follow?
Is it possible to use devices that are on different machines? For example, in Horovod I can specify the IP addresses of multiple machines and do data parallelism across them. However, this requires me to specifically have MPI setup on each machine. It's unclear to me if this can be done with TF Mesh. Maybe with a tf.train.clusterspec and the parameter server model??
Thanks.
-Tony
Any plans to support Mesh Tensorflow?
Hello,
Both evaluation and prediction currently not working with the aligned model "Bert Style".
I have fixed this issue by adding a new if statement in "transformer/utils.py":
elif mode == tf.estimator.ModeKeys.PREDICT:
inputs = mtf_features["inputs"]
if predict_fn:
mtf_samples = predict_fn(
model=transformer_model,
features=mtf_features,
variable_dtype=get_variable_dtype())
elif isinstance(transformer_model, transformer.Unitransformer) and model_type == 'aligned':
# pad so that there is enough room for the targets
inputs = mtf.pad(
inputs, [0, sequence_length["targets"]], length_dim.name)
logits, _ = transformer_model.call_simple(
inputs=inputs, variable_dtype=get_variable_dtype(),
compute_loss=False,
mode=tf.estimator.ModeKeys.PREDICT)
label_c_dim = mtf.Dimension('vocab', 256)
mtf_samples = mtf.argmax(logits, label_c_dim)
As well as "transformer/transformer.py" needs to be modified :
def call_simple(self,
inputs = None,
targets = None,
compute_loss = False,
mode=tf.estimator.ModeKeys.TRAIN,
variable_dtype=mtf.VariableDType(tf.float32),
sequence_id=None,
subsequence_id=None,
position=None,
encoder_output=None,
encoder_sequence_id=None,
encoder_inputs=None,
shared_params=None,
layer_outputs=None,
encoder_layer_outputs=None,
num_microbatches=1):
The only thing that I am currently defining manually is "label_c_dim".
@adarob @craffel @nshazeer It will be great if you could merge my code or defining a better solution and find an automatic way to find the vocab size for "label_c_dim".
Hi,
I am using Google T5 library which is based on TensorFlow mesh for training a non-autoregressive model like Bert.
The training running without a problem, but both the prediction and the evaluation don't work because of the Unitransformer model expects only an autoregressive model for decoding.
ERROR:tensorflow:Error recorded from prediction_loop: must be autoregressive
In call to configurable 'sample_autoregressive' (<function Unitransformer.sample_autoregressive at 0x7f2a3276a620>)
INFO:tensorflow:prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-86647c2a14e0> in <module>()
2 model.eval(
3 mixture_or_task_name="ss3",
----> 4 checkpoint_steps="all"
5 )
31 frames
/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/transformer.py in sample_autoregressive(self, partial_sequences, stop_at_token, max_steps, temperature, variable_dtype, encoder_output, encoder_sequence_id, encoder_inputs, shared_params, has_partial_sequences, encoder_layer_outputs, never_end, remove_partial_sequences, sampling_keep_top_k)
778 """
779 if not self.autoregressive:
--> 780 raise ValueError("must be autoregressive")
781
782 inputs = partial_sequences
ValueError: must be autoregressive
In call to configurable 'sample_autoregressive' (<function Unitransformer.sample_autoregressive at 0x7f2a3276a620>)
In call to configurable 'decode' (<function decode at 0x7f2a3270eb70>)
This is the gin file, that I have used:
import mesh_tensorflow.optimize
import mesh_tensorflow.transformer.learning_rate_schedules
import mesh_tensorflow.transformer.transformer_layers
import t5.models.mesh_transformer
import t5.data.sentencepiece_vocabulary
# Macros:
# ==============================================================================
d_ff = 3072
d_kv = 64
d_model = 768
dropout_rate = 0.1
MIXTURE_NAME = 'ss3'
num_heads = 12
num_layers = 12
model_parallelism = 1
split= "train"
tokens_per_batch = 65536
# Parameters for AdafactorOptimizer:
# ==============================================================================
AdafactorOptimizer.beta1 = 0.0
AdafactorOptimizer.clipping_threshold = 1.0
AdafactorOptimizer.decay_rate = None
AdafactorOptimizer.epsilon1 = 1e-30
AdafactorOptimizer.epsilon2 = 0.001
AdafactorOptimizer.factored = True
AdafactorOptimizer.min_dim_size_to_factor = 128
AdafactorOptimizer.multiply_by_parameter_scale = True
# Parameters for denoise:
# ==============================================================================
denoise.inputs_fn = @preprocessors.noise_span_to_unique_sentinel
denoise.noise_density = 0.15
denoise.noise_mask_fn = @preprocessors.iid_noise_mask
denoise.targets_fn = @preprocessors.nonnoise_span_to_unique_sentinel
# Parameters for DenseReluDense:
# ==============================================================================
DenseReluDense.dropout_rate = %dropout_rate
DenseReluDense.hidden_size = %d_ff
# Parameters for drop_noise_tokens:
# ==============================================================================
# None.
# Parameters for drop_nonnoise_tokens:
# ==============================================================================
# None.
# Parameters for get_dataset:
# ==============================================================================
# Parameters for get_sentencepiece_model_path:
# ==============================================================================
get_sentencepiece_model_path.mixture_or_task_name = %MIXTURE_NAME
# Parameters for get_variable_dtype:
# ==============================================================================
get_variable_dtype.activation_dtype = 'bfloat16'
# Parameters for iid_noise_mask:
# ==============================================================================
# None.
# Parameters for LayerStack:
# ==============================================================================
LayerStack.dropout_rate = %dropout_rate
LayerStack.norm_epsilon = 1e-06
# Parameters for learning_rate_schedule_noam:
# ==============================================================================
learning_rate_schedule_noam.linear_decay_fraction = 0.0
learning_rate_schedule_noam.multiplier = 1.0
learning_rate_schedule_noam.offset = 0
learning_rate_schedule_noam.warmup_steps = 10000
# Parameters for make_layer_stack:
# ==============================================================================
make_layer_stack.block_scope = True
make_layer_stack.layers = \
[@mesh_tensorflow.transformer.transformer_layers.SelfAttention,
@mesh_tensorflow.transformer.transformer_layers.DenseReluDense]
make_layer_stack.num_layers = %num_layers
# Parameters for mesh_train_dataset_fn:
# ==============================================================================
mesh_train_dataset_fn.mixture_or_task_name = %MIXTURE_NAME
# Parameters for noise_span_to_unique_sentinel:
# ==============================================================================
# None.
# Parameters for nonnoise_span_to_unique_sentinel:
# ==============================================================================
# None.
# Parameters for pack_dataset:
# ==============================================================================
# Parameters for pack_or_pad:
# ==============================================================================
# None.
# Parameters for rate_num_examples:
# ==============================================================================
rate_num_examples.maximum = 524288
rate_num_examples.scale = 1.0
rate_num_examples.temperature = 1.0
# Parameters for reduce_concat_tokens:
# ==============================================================================
reduce_concat_tokens.batch_size = 128
reduce_concat_tokens.feature_key = 'targets'
# Parameters for run:
# ==============================================================================
run.autostack = True
run.batch_size = ('tokens_per_batch', %tokens_per_batch)
run.dataset_split = %split
run.ensemble_inputs = None
run.eval_checkpoint_step = None
run.eval_dataset_fn = None
run.eval_summary_dir = None
run.export_path = ''
run.iterations_per_loop = 100
run.keep_checkpoint_max = None
run.layout_rules = \
'ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch'
run.learning_rate_schedule = @learning_rate_schedules.learning_rate_schedule_noam
run.mesh_shape = @mesh_tensorflow.transformer.utils.tpu_mesh_shape()
run.mode = 'train'
run.model_type = 'aligned'
run.optimizer = @optimize.AdafactorOptimizer
run.perplexity_eval_steps = 10
run.predict_fn = None
run.save_checkpoints_steps = 5000
run.sequence_length = {'inputs': 512, 'targets': 512}
run.train_dataset_fn = @t5.models.mesh_transformer.mesh_train_dataset_fn
run.train_steps = 786432
run.variable_filter = None
run.vocabulary = @t5.data.sentencepiece_vocabulary.SentencePieceVocabulary()
# Parameters for select_random_chunk:
# ==============================================================================
select_random_chunk.feature_key = 'targets'
select_random_chunk.max_length = 65536
# Parameters for SelfAttention:
# ==============================================================================
SelfAttention.attention_kwargs = None
SelfAttention.dropout_rate = %dropout_rate
SelfAttention.key_value_size = %d_kv
SelfAttention.num_heads = %num_heads
SelfAttention.num_memory_heads = 0
SelfAttention.relative_attention_num_buckets = 32
SelfAttention.relative_attention_type = 'bias_shared'
SelfAttention.shared_kv = False
# Parameters for SentencePieceVocabulary:
# ==============================================================================
SentencePieceVocabulary.extra_ids = 100
SentencePieceVocabulary.sentencepiece_model_file = \
@t5.models.mesh_transformer.get_sentencepiece_model_path()
# Parameters for serialize_num_microbatches:
# ==============================================================================
serialize_num_microbatches.tokens_per_microbatch_per_replica = 2048
# Parameters for split_tokens:
# ==============================================================================
split_tokens.feature_key = 'targets'
split_tokens.min_tokens_per_segment = None
# Parameters for split_tokens_to_inputs_length:
# ==============================================================================
# None.
# Parameters for tpu_estimator_model_fn:
# ==============================================================================
tpu_estimator_model_fn.outer_batch_size = 1
tpu_estimator_model_fn.tpu_summaries = False
# Parameters for tpu_mesh_shape:
# ==============================================================================
tpu_mesh_shape.ensemble_parallelism = None
tpu_mesh_shape.model_parallelism = %model_parallelism
tpu_mesh_shape.tpu_topology = %tpu_topology
# Parameters for Unitransformer:
# ==============================================================================
Unitransformer.d_model = %d_model
Unitransformer.ensemble = None
#Unitransformer.input_full_attention = True
Unitransformer.label_smoothing = 0.0
Unitransformer.loss_denominator = None
Unitransformer.loss_fn = None
Unitransformer.loss_on_targets_only = False
Unitransformer.max_length = 512
Unitransformer.name = 'transformer'
Unitransformer.positional_embedding = True
Unitransformer.shared_embedding_and_softmax_weights = True
Unitransformer.vocab_divisor = 128
Unitransformer.z_loss = 0.0001
# Parameters for unsupervised:
# ==============================================================================
unsupervised.preprocessors = \
[@preprocessors.select_random_chunk,
@preprocessors.reduce_concat_tokens,
@preprocessors.split_tokens_to_inputs_length,
@preprocessors.denoise]
Is there a solution for that or currently the non-autoregressive doesn't work for eval and predict ?
Run the Transfomer model (no Tensor2Tensor dependencies)
The file "examples/transformer_standalone.py" does not exist anymore.
There is a pull request
and transformer_standalone.py has been replaced by mesh_tensorflow.transformer.main
9 months ago, but it wasn't merged with the main branch.
I am running an experiment that requires:
tensorflow==1.13.1
or tensorflow-gpu==1.13.1
tensor2tensor==1.11.0
In tensor2tensor==1.11.0
and mesh-tensorflow==0.1.1
, it imports mesh_tensorflow which further imports tensorflow.python.tpu.ops
:
import mesh_tensorflow as mtf
#File "/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/__init__.py", line 26, in <module>
from mesh_tensorflow import simd_mesh_impl
#File "/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/simd_mesh_impl.py", line 32, in <module>
from tensorflow.python.tpu.ops import tpu_ops # pylint: disable=g-direct-tensorflow-import
#ModuleNotFoundError: No module named 'tensorflow.python.tpu'
In my version of TF 1.13.1 there is no tensorflow.python.tpu
. Any way to fix this error? Which version of mesh_tensorflow should I downgrade to?
Could you please set to False
the default value of ignore_comments
?
mesh/mesh_tensorflow/transformer/utils.py
Line 766 in 7de6e9b
I'm using T5 and it took me a while to find out why some of the lines in the input files were being discarded.
I like to run the following Keras example deduced from here
# 1D CNN neural network
model_m = Sequential()
model_m.add(Reshape((TIME_PERIODS, num_sensors), input_shape=(input_shape,)))
model_m.add(Conv1D(100, 10, activation='relu', input_shape=(TIME_PERIODS, num_sensors)))
model_m.add(Conv1D(100, 10, activation='relu'))
model_m.add(MaxPooling1D(3))
model_m.add(Conv1D(160, 10, activation='relu'))
model_m.add(Conv1D(160, 10, activation='relu'))
model_m.add(GlobalAveragePooling1D())
model_m.add(Dropout(0.5))
model_m.add(Dense(num_classes, activation='softmax'))
on more than 1 machine (maybe having two nodes of CPUs each having multiple cores). Can I use mesh_tensorflow graphs for convolutional layers?
I like to apply both data and spatial parallelism on this example (maybe on a bigger data) on two identical machines. Would you please help me with this? I couldn't find many examples of using TFMesh.
Thanks
In the toy_model_tpu.py exampe, params['context'] is used to understand device assignments and host placements. Where is its value populated?
def model_fn(features, labels, mode, params):
...
if FLAGS.use_tpu:
ctx = params['context']
I want to run mnist.py example via mpirun to use devices from different nodes, ¿it is possible actually?
I have an image classification model defined in Keras that I'm attempting to parallelize with MTF. However, it's not clear to me whether MTF support exists for keras.layers/tf.layers or if I'll need to recreate my model in MTF. Does MTF support keras.layers or tf.layers?
Does MTF exclusively use sessions for training or is there support for TF 2.0 eager execution?
If the answer is "no" to either of the above questions, is there any plan to add support in the future?
Is it possible to incorporate MultiworkerMirroredStrategy into Mesh TF? I would like to run model + data parallelism on a supercomputer that has multiple GPUs on multiple nodes.
It seems that, by default, MultiworkerMirroredStrategy uses all possible GPUs and replicates the model across nodes, making model parallelism by Mesh TF difficult to run on multiple nodes.
Hi there,
Thanks for creating this framework. I was trying to run the transformer example provided in the README.md and I realized some files are missing in the repository.
Could you please update those files?
For example, examples/transformer_standalone.py
is missing. I looked at the commit history and still could not find it. Seems like it missed to push in.
python examples/transformer_standalone.py --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL --gin_file=$LAYOUT --gin_param="run.mode='train'"
Version:
Tensorflow : v1.13
mesh-tensorflow : head of the repo.
(Sorry I could not label as per contribution guidelines as the permissions are not available to do. )
I'm trying to fine-tuning a released T5 checkpoint in float32,
but I get the following error:
2020-09-03 16:33:42.380962: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name =
/block_018/layer_002/layer_norm/scale; expected dtype float does not equal original dtype bfloat16
Is what I'm trying to do supported? These are the relevant parts I set:
--gin_param="get_variable_dtype.activation_dtype = 'float32'"
--gin_param="get_variable_dtype.master_dtype = 'float32'"
--gin_param="get_variable_dtype.slice_dtype = 'float32'"
--gin_file="gs://t5-data/pretrained_models/3B/operative_config.gin"
(We explicitly want float32)
We need a nightly package so that, for example, Tensor2Tensor's open source does not break when it runs Travis builds using the latest functionality here.
Hello
I found a bug (brackets missing) about mesh tensorflow.
Please check the code.
https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/simd_mesh_impl.py
Line 882: return _default_value
It should be return _default_value() , right ?
Regards,
Andy
For example, we may want to load training data on a mesh of 64 cpu-machines and infeed them to a mesh of 512 tpu-cores. We do not need this for language tasks where the data is tiny, but it will be important for other tasks.
Any comments about GPipe which was supposed to be open sourced by Google soon?
Looks like both GPipe and Mesh can do model/data parallelism.
easy to get NaNs when x is close to 0
@NikiP: this is resolved, right?
currently broadcasting semantics aren't the same as regular tensorflow
When packing is done here https://github.com/tensorflow/mesh/blob/6a812c8bb847e081e976533ed497c7c5016bb1ec/mesh_tensorflow/transformer/dataset.py
Each packed sequence has multiple examples ("segments"). I'm trying to figure out where do you prevent information to leak between these examples (e.g in attention).
I came across this
mesh/mesh_tensorflow/layers.py
Line 1813 in 4db643b
I can't seem to find where the information leak is prevented elsewhere. Can you clarify?
Hi,
To speed up training on V100 GPUs, I'd like to run mesh tf using mixed precision. While TensorFlow has an easy to use automatic mixed precision feature, it requires the optimizer to be a tf.train.Optimizer. This won't work on mesh tf's optimizers.
My question is: how can I use mixed precision on GPUs with mesh tf? If not supported yet, can you add some support for this? Thanks.
We tried to run Mesh-TensorFlow to train T5 on GPUs following the instructions on T5's repository, but the training is extremely slow.
global_step/sec: 0.0467347
examples/sec: 0.186939
The training script successfully detected GPUs (showing "Adding visible gpu devices: ..."), but most of computation seems to run on a CPU.
By enabling log_device_placement, we can see many operators on both CPUs and GPUs.
ProfilerHook showed that it actually uses both, but I couldn't know if the behavior is expected or not.
I am wondering if Mesh-TensorFlow runs on GPUs in a practical sense.
I found an issue that mentioned a similar problem, but it was closed with no answer (#35).
I also failed to find reliable documents about training on multiple GPUs.
An existing issue #20 mentioned the same question, but no answer was given.
I appreciate if someone could give us any information regarding the above questions.
cluster@master0:~/diseaseTools$ clear
cluster@master0:~/diseaseTools$ docker run -it python:3.6-jessie sh
# pip install mesh-tensorflow
Collecting mesh-tensorflow
Downloading https://files.pythonhosted.org/packages/7b/9a/8f46d2bf6ecc8f622a4d3a7a9838c340bf0e6523a2bfc2a56a0ce870d2d8/mesh_tensorflow-0.0.1-py2.py3-none-any.whl
Collecting six (from mesh-tensorflow)
Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting future (from mesh-tensorflow)
Downloading https://files.pythonhosted.org/packages/00/2b/8d082ddfed935f3608cc61140df6dcbf0edea1bc3ab52fb6c29ae3e81e85/future-0.16.0.tar.gz (824kB)
100% |████████████████████████████████| 829kB 21.6MB/s
Building wheels for collected packages: future
Running setup.py bdist_wheel for future ... done
Stored in directory: /root/.cache/pip/wheels/bf/c9/a3/c538d90ef17cf7823fa51fc701a7a7a910a80f6a405bf15b1a
Successfully built future
Installing collected packages: six, future, mesh-tensorflow
Successfully installed future-0.16.0 mesh-tensorflow-0.0.1 six-1.11.0
# # python
Python 3.6.6 (default, Oct 16 2018, 07:22:54)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mesh_tensorflow as mtf
>>> mtf.Graph()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'mesh_tensorflow' has no attribute 'Graph'
>>> mtf.__path__
['/usr/local/lib/python3.6/site-packages/mesh_tensorflow']
>>> quit()
# ls /usr/local/lib/python3.6/site-packages/mesh_tensorflow
__init__.py __pycache__ import_test.py
#
as shown there is nothing inside the package.
When I do the equivalent with the dev install, pip install -e "git+https://github.com/tensorflow/mesh.git#egg=mesh-tensorflow"
things work.
mtf.dropout(x, 0.1)
means dropout with 90% probability.
tf.dropout(x, 0.1)
means dropout with 10% probability.
For around a month, this has caused an agonizing bug with a GPT project that was ported to mesh tensorflow.
Is there a reason this is inverted? Is it too late to change? If not, you might want to issue some sort of warning, somewhere. Although mtf doesn't explicitly say that it's compatible with the tf api, it was somewhat shocking to end-users that it inverted a basic operation.
mtf.expand_dims can be implemented in terms of stack
mtf.squeeze can be implemented in terms of reduce_sum + some sanity checks
File "/root/code/src/mesh/mesh_tensorflow/ops.py", line 656, in copy_masters_to_slices
return mesh_impl.copy_master_to_slice_ops[-1]
AttributeError: 'PlacementMeshImpl' object has no attribute 'copy_master_to_slice_ops'
Perhaps sequence mode should be a flag to SimdMeshImpl instead of an environment variable
mesh_shape = [("processor_rows", 2), ("processor_cols", 2)]
layout_rules = [("batch", "processor_rows"), ("hidden", "processor_cols")]
The above code change is mentioned to be using both model and data parallelism. But we will get "mesh_size error". So we need to change the value for mesh_size also. It should be *mesh_size=len(mesh_shape)len(mesh_shape[0])
Is it possible to split it such that layers are split along some dimension of the mesh too?
For example:
Mesh shape: x:16,y:32
Layout: layers: x, hidden: y
If I had 32 layers, for example, I'd like the result to have 2 layers on the first slice of x, 2 layers on the next slice, etc. Ideally, something like GPipe where the forward and backward passes are pipelined so that 15/16ths of the devices don't sit idle would be preferable, but even being able to do the split naïvely would be useful.
Dear authors,
I have read the code of auto-mesh. I found that when calculating the memory consumption given a schedule, it only included the consumption by the forward phase, but did not include the backward phase. This confused me, because the backpropagation also produces new data in memory.
Is there something I missed? or you did this way on purpose?
thanks for your answering,
Xiaoda
Line 13 in lm1b.gin "dataset.get_tfds_vocabulary.dataset_name = %dataset_name"
causes an error
There is no function named "get_tfds_vocabulary"
in /mesh_tensorflow/transformer/dataset.py
To fix the error the line can be replaced with
"vocabulary.get_tfds_vocabulary.dataset_name=%dataset_name"
not a big deal, but can lead to some accidental bugs. possibly a strict check would be better
My understanding from the readme is that there is some flexibility in the TPU Mesh, but all operations must replicated on all TPU cores.
Will there ever be support for reducing an encoder split across 8 cores to run a decoder on a single core?
Effectively, the graph would take an input of (cores * bs, other shapes) and the output would simply be (1, other shapes). A example usage would be encoding a set of tweets and outputting a single summary.
Hi, does MTF support overlapped meshes? For example, for a NN model with 6 layers, I want to parallelize three first layers with 1d mesh and three remaining with 2d mesh. These two meshes are overlapping on a 4 devices. If it's not allow in MTF, is there any solution to do that?
@NikiP: this is resolved right?
Travis are now recommending removing the sudo tag.
"If you currently specify sudo: false in your .travis.yml, we recommend removing that configuration"
Hi, I have been trying to use mesh tensorflow on GPUs. I ran the mnist.py example to test the speed using GPU and CPU, by setting CUDA_VISIBLE_DEVICES variable (removed convolutional layers due to cuDNN version). However, using GPUs I obtained 80-100 global_steps/sec, and got similars values using CPU. I originally doubts the real support for GPUs from my attempts to train T5 model using GPUs. Do you have a working example that demonstrates the support for GPUs, particularly on the aspect of speed?
Alternatively, we may not want to commit to more open-source platforms (this website, but also a mailing list). Instead, we may want to look into how Mesh TF could be merged into core TF. If that's the future, this TODO would only be useful for the short-term.
When I was running the mnist.py
, it occurred that in mnist_dataset.py
, function download
,
os.remove(zipped_filepath)
couldn't work due to PermissionError.
Therefore, changing this code into this might works.
try: os.remove(zipped_filepath) except PermissionError: pass
This paper Low-Rank Bottleneck in Multi-head Attention Models suggests that we could fix the head size and keep hidden size unchanged. Could you support setting d_k
, d_q
, d_v
independently instead of d_kv
.
In mesh_tensorflow/transformer/t2t_vocabulary.py, it need to import subword_text_encoder_ops from tensor2tensor.data_generators.ops. However, there is no subword_text_encoder_ops.py in tensor2tensor repository. Tensor2tensor only contains subword_text_encoder_ops.cc file.
Hello I am trying to run the mnist python code in example section. When I tried to run them I observed they are using only 1 GPU, for all the three data parallelism, model parallelism , data and model parallelism. How can I make them to run on multiple GPU's.
The mtf_transformer in Tensor2Tensor defaults to a mesh configuration for TPUs that uses 32 cores or 4 Cloud TPUs. I wasn't able to find documentation on utilizing more than a single Cloud TPU, but I tried it anyway with TPU_NAME=grpc://tpu0:8470,grpc://tpu1:8470
and got an error:
*** InternalError: Invalid system configuration: 1x1 host topology with 0 missing hosts, but 2 hosts in total.
I am using TF 1.11.0 and the meshTF in Tensor2Tensor 1.9.0, for compatibility with Cloud TPU.
culprit:
return reduce_sum(x, output_shape=output_shape) * (output_shape.size / x.shape.size)
Desired behavior:
related bug: division by 0 shouldn't crash, should return +- inf
relevant line: return ScalarMultiplyOperation(x1, 1.0 / x2).outputs[0]
I am trying to run the transformer model with Tensor2tensor using mesh-tensorflow (GPU-implementation) but I am facing few errors.
steps to reproduce:
PROBLEM=translate_enfr_wmt32k
MODEL=mtf_transformer
HPARAMS=mtf_transformer_paper_tr_0_mesh_8
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
datagen:
t2t-datagen
--data_dir=$DATA_DIR
--tmp_dir=$TMP_DIR
--problem=$PROBLEM
train:
t2t-trainer
--data_dir=$DATA_DIR
--problem=$PROBLEM
--model=$MODEL
--hparams_set=$HPARAMS
--output_dir=$TRAIN_DIR
--train_steps=10
error
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Multiple OpKernel registrations match NodeDef '{{node transformer/dropout/binary_op/parallel_0_1/Less}}': 'op: "Less" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_BFLOAT16 } } }' and 'op: "Less" device_type: "CPU" constraint { name: "T" allowed_values { list { type: DT_BFLOAT16 } } }'
[[transformer/dropout/binary_op/parallel_0_1/Less]]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.