snuspl / parallax Goto Github PK
View Code? Open in Web Editor NEWA Tool for Automatic Parallelization of Deep Learning Training in Distributed Multi-GPU Environments.
License: Apache License 2.0
A Tool for Automatic Parallelization of Deep Learning Training in Distributed Multi-GPU Environments.
License: Apache License 2.0
Step numbers of run_meta_XX
files should be changed.
For example, if you run 7th iteration (global step is 7) of nmt
example code, profiled data from that iteration is saved in run_meta_10
.
Global step value and a number from the saved file name should be consistent.
For now, run_metadata
is saved with the following code: file.write('%s' % metadata)
.
However, this is incompatible with metadata loading API, metadata.MergeFromString()
.
Thus, metadata should be written to the file with file.write(metadata.SerializeToString())
, if you want to easily load the file and use it.
run_metadata
is saved as a human-readable file.
The format is incompatible with TF's API to load the metadata file.
run_metadata
should be saved in the format, which is compatible with TF's metadata loading API.
Since current in-graph replication logic makes both system and user-API complicated,
we can explore if between-graph replication for PS would make sense in terms of performance.
In-Graph replication with replica operation names in a single worker
Compare the performance and choose between-graph if it does not slow the performance
Horovod version update
Current horovod version is 0.11.2
Update version as 0.16.3
According to NMT, embedding parameter goes to a worker if the number of embedding partitions is one. It happens in the Parallax, too.
PS fails when the number of embedding partitions is one.
Work well regardless of the number of embedding partitions.
Run NMT example with "num_embedding_partitions=1"
To implement RNN models, a user can represent the RNN hidden state with Variable to correctly pass it between multiple session runs.
However, current parallax just places it in PS, and it is not replicated into workers even the developer specified that the variable is 'local variable' which should not be replicated. It leads to the incomplete convergence of the model.
Furthermore, the fact that it is the user's responsibility to specify which variable should be replicated(GLOBAL_VARIABLE) and which should not (LOCAL_VARIABLE) also seems to be a problem, since parallax aims to support 'automatic parallelization'.
Users have to distinguish between local and global variables by themselves.
Parallax automatically detects a local variable or helps a user with a warning
Fix the broken dependency between assigning global gradients and reading them in a hybrid communication
The first worker has no dependency between assigning global gradients and reading them so that it reads gradients for the previous iteration.
Read global gradients after updates for the current iteration in the first worker.
Add Profiling option into ParallaxConfig.
Save RunMetadata to the profile_dir
when the global_step is one of profile_steps
Hi,
Good day. I have tried to run example: Simple in your code. I have an issue when I was running the example.
CUDA Toolkit 9.0
CuDNN SDK v7
openmpi-3.0.0
NCCL 2.1.15(for cuda9.0)
Below is the result from running simple example:
-c': not a valid identifier bash: line 0: export:
-u': not a valid identifier[[43684,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: node03
Another transport will be used instead, although this may result in
lower performance.
2020-05-14 05:49:50.857407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1412] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 7.93GiB freeMemory: 7.82GiB
2020-05-14 05:49:50.857465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1491] Adding visible gpu devices: 0
2020-05-14 05:49:51.340142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-14 05:49:51.340220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:978] 0
2020-05-14 05:49:51.340235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0: N
2020-05-14 05:49:51.340349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2020-05-14 05:49:51.602832: E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'
2020-05-14 05:49:51.602953: E tensorflow/core/common_runtime/executor.cc:630] Executor failed to create kernel. Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'
[[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'
[[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 192, in parallax_run_mpi
config=sess_config)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
init_fn=self._scaffold.init_fn)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'
[[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'w', defined at:
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 158, in parallax_run_mpi
tf.train.import_meta_graph(mpi_meta_graph_def)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1666, in import_meta_graph
meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1688, in _import_meta_graph_with_return_elements
**kwargs))
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
_ProcessNewOps(graph)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3297, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in init
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'
[[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Add test set to test:
of various models including:
on various environments including:
In the meantime, we can add CI integration support to test (the subset of) above requirements and check the coding style, etc.
Enable automatic variable partitioning for large variables
Variable partitioning is manually done
Parallax finds a near-optimal number of partitions automatically
Parallax uses TensorFlow 1.6. Upgrade TensorFlow to 1.11 when it's released.
Parallax uses TensorFlow 1.6
Parallax uses TensorFlow 1.11
Remove run function
parallax.parallel_run receives run function including the number of iterations
parallax.parallel_run returns session for the graph of distributed version
Hierarchical profile data
'Single Device Graph' example in Quick Start may need a modification.
Maybe with single_device_graph.as_default_graph():
in line 4 should be changed to with single_device_graph.as_default():
?
Running the graph code results in the following error:
Traceback (most recent call last):
File "graph.py", line 4, in <module>
with single_device_graph.as_default_graph():
AttributeError: 'Graph' object has no attribute 'as_default_graph'
No error is expected, as it should build a valid TensorFlow graph.
Execute the given python code using CPyhton 2.7.12 interpreter.
N/A
Add Hybrid communication which uses MPI and PS both.
The new communication utilizes the advantages of MPI and PS to speed up training of RNN(LM1B, NMT) models.
A DL model runs on either PS or MPI
Adding one more option, Hybrid
Bug fix for killing launch_ps process.
N/A
launch_ps
processes are clearly terminated when the main app is killed.
N/A
Run any app with run_option=PS
and send SIGINT, launch_ps
processes will not be terminated.
Hi,
I'd like to try using this. Seems really nice. I have a question (please note I'm new to TensorFlow)
I have N machines, each with one GPU, M Gigabytes of disk storage and J Gigabytes of memory, and my training dataset is accessible to all machines. How do I configure things so that when training in async data parallel mode (PS option) I can make sure only M and J Gigabytes are used per machine? So I can avoid any memory errors.
Could you setup a Google Groups for this project so we may ask these kinds of questions there?
Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.