The parallax from snuspl

Step number of `run_meta_XX` file is not matched with actual global step

Things to Change

Step numbers of run_meta_XX files should be changed.

Current Behavior

For example, if you run 7th iteration (global step is 7) of nmt example code, profiled data from that iteration is saved in run_meta_10.

Expected Behavior

Global step value and a number from the saved file name should be consistent.

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Change `run_metadata` file format

Things to Change

For now, run_metadata is saved with the following code: file.write('%s' % metadata).
However, this is incompatible with metadata loading API, metadata.MergeFromString().
Thus, metadata should be written to the file with file.write(metadata.SerializeToString()), if you want to easily load the file and use it.

Current Behavior

run_metadata is saved as a human-readable file.
The format is incompatible with TF's API to load the metadata file.

Expected Behavior

run_metadata should be saved in the format, which is compatible with TF's metadata loading API.

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Intra-machine between-graph replication for PS

Things to Change

Since current in-graph replication logic makes both system and user-API complicated,
we can explore if between-graph replication for PS would make sense in terms of performance.

Current Behavior

In-Graph replication with replica operation names in a single worker

Expected Behavior

Compare the performance and choose between-graph if it does not slow the performance

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Horovod version update

Things to Change

Horovod version update

Current Behavior

Current horovod version is 0.11.2

Expected Behavior

Update version as 0.16.3

Embedding parameter placement in NMT model

Things to Change

According to NMT, embedding parameter goes to a worker if the number of embedding partitions is one. It happens in the Parallax, too.

Current Behavior

PS fails when the number of embedding partitions is one.

Expected Behavior

Work well regardless of the number of embedding partitions.

Failure Information (for bugs)

Failure Logs

How to Reproduce

Run NMT example with "num_embedding_partitions=1"

Related Issues

Decide how to handle 'Local Variable'

Things to Change

To implement RNN models, a user can represent the RNN hidden state with Variable to correctly pass it between multiple session runs.
However, current parallax just places it in PS, and it is not replicated into workers even the developer specified that the variable is 'local variable' which should not be replicated. It leads to the incomplete convergence of the model.
Furthermore, the fact that it is the user's responsibility to specify which variable should be replicated(GLOBAL_VARIABLE) and which should not (LOCAL_VARIABLE) also seems to be a problem, since parallax aims to support 'automatic parallelization'.

Current Behavior

Users have to distinguish between local and global variables by themselves.

Expected Behavior

Parallax automatically detects a local variable or helps a user with a warning

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Broken dependency in hybrid communication

Things to Change

Fix the broken dependency between assigning global gradients and reading them in a hybrid communication

Current Behavior

The first worker has no dependency between assigning global gradients and reading them so that it reads gradients for the previous iteration.

Expected Behavior

Read global gradients after updates for the current iteration in the first worker.

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Add Profiling Option

Things to Change

Add Profiling option into ParallaxConfig.

Current Behavior

Expected Behavior

Save RunMetadata to the profile_dir when the global_step is one of profile_steps

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Training Simple Example issue

Failure Logs

Hi,

Good day. I have tried to run example: Simple in your code. I have an issue when I was running the example.

CUDA Toolkit 9.0
CuDNN SDK v7
openmpi-3.0.0
NCCL 2.1.15(for cuda9.0)

Below is the result from running simple example:

/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null�[0m
Connection to 10.0.0.103 closed.
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null�[0m
Connection to 10.0.0.103 closed.
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null�[0m
Connection to 10.0.0.103 closed.
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -p 22 10.0.0.103 "mkdir -p /tmp/parallax-jyi"�[0m
WARNING:139880522606336:PARALLAX:�[31m
$ echo 'bash -c "export schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; CUDA_VISIBLE_DEVICES=1; export PARALLAX_LOG_LEVEL=20; export PARALLAX_HOSTNAME=10.0.0.103; export PARALLAX_SEARCH=False; source /home/jyi/parallax_venv/bin/activate; python3 /home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py "' | ssh -p 22 10.0.0.103 'cat > /tmp/parallax-jyi/mpi_run.sh; chmod 777 /tmp/parallax-jyi/mpi_run.sh'�[0m
WARNING:139880522606336:PARALLAX:�[31m
$ schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; export CUDA_VISIBLE_DEVICES=1; source /home/jyi/parallax_venv/bin/activate; export PATH=/Home/.openmpi/bin:$PATH;export LD_LIBRARY_PATH=~/.openmpi/lib/:$LD_LIBRARY_PATH; mpirun -bind-to none -map-by slot --mca plm_rsh_no_tree_spawn 1 --mca orte_base_help_aggregate 0 -x NCCL_DEBUG=INFO -x PARALLAX_RUN_OPTION=PARALLAX_RUN_MPI -x PARALLAX_RESOURCE_INFO=master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1 -np 1 -H 10.0.0.103:1 bash /tmp/parallax-jyi/mpi_run.sh 2>&1�[0m
/bin/sh: 1: schroot: not found
/bin/sh: 1: source: not found
bash: line 0: export: `-c': not a valid identifier bash: line 0: export:` -u': not a valid identifier
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
INFO:139646709864192:PARALLAX:parallel_run(PARALLAX_RUN_MPI)
INFO:139646709864192:PARALLAX:resource master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1

[[43684,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: node03

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

2020-05-14 05:49:50.857407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1412] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 7.93GiB freeMemory: 7.82GiB
2020-05-14 05:49:50.857465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1491] Adding visible gpu devices: 0
2020-05-14 05:49:51.340142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-14 05:49:51.340220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:978] 0
2020-05-14 05:49:51.340235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0: N
2020-05-14 05:49:51.340349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2020-05-14 05:49:51.602832: E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

2020-05-14 05:49:51.602953: E tensorflow/core/common_runtime/executor.cc:630] Executor failed to create kernel. Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Traceback (most recent call last):
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 192, in parallax_run_mpi
config=sess_config)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
init_fn=self._scaffold.init_fn)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'w', defined at:
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 158, in parallax_run_mpi
tf.train.import_meta_graph(mpi_meta_graph_def)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1666, in import_meta_graph
meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1688, in _import_meta_graph_with_return_elements
**kwargs))
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
_ProcessNewOps(graph)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3297, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in init
self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[43684,1],0]
Exit code: 1

CI support

Things to Change

Add test set to test:

Correctness (make models deterministic, and see compare the exact numbers)
Throughput
Rule-wise validity

of various models including:

on various environments including:

In the meantime, we can add CI integration support to test (the subset of) above requirements and check the coding style, etc.

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Automatic variable partitioning

Things to Change

Enable automatic variable partitioning for large variables

Current Behavior

Variable partitioning is manually done

Expected Behavior

Parallax finds a near-optimal number of partitions automatically

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Upgrade TensorFlow to 1.11 when it's released

Things to Change

Parallax uses TensorFlow 1.6. Upgrade TensorFlow to 1.11 when it's released.

Current Behavior

Parallax uses TensorFlow 1.6

Expected Behavior

Parallax uses TensorFlow 1.11

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Remove run function

Things to Change

Remove run function

Current Behavior

parallax.parallel_run receives run function including the number of iterations

Expected Behavior

parallax.parallel_run returns session for the graph of distributed version

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Modify profile directory

Things to Change

Make hierarchies for hosts, tasks
Save graph def for all workers in the profile directory
Add profile_range option

Current Behavior

No hierarchy
Graph def is not saved for profile
Only profile_steps option

Expected Behavior

Hierarchical profile data

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Failed to build single device graph in quick start doc

Things to Change

'Single Device Graph' example in Quick Start may need a modification.

Maybe with single_device_graph.as_default_graph(): in line 4 should be changed to with single_device_graph.as_default():?

Current Behavior

Running the graph code results in the following error:

Traceback (most recent call last):
  File "graph.py", line 4, in <module>
    with single_device_graph.as_default_graph():
AttributeError: 'Graph' object has no attribute 'as_default_graph'

Expected Behavior

No error is expected, as it should build a valid TensorFlow graph.

How to Reproduce

Execute the given python code using CPyhton 2.7.12 interpreter.

Related Issues

N/A

Add Hybrid Communication

Things to Change

Add Hybrid communication which uses MPI and PS both.
The new communication utilizes the advantages of MPI and PS to speed up training of RNN(LM1B, NMT) models.

Current Behavior

A DL model runs on either PS or MPI

Expected Behavior

Adding one more option, Hybrid

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Fix killing launch_ps processes

Things to Change

Bug fix for killing launch_ps process.

Current Behavior

N/A

Expected Behavior

launch_ps processes are clearly terminated when the main app is killed.

Failure Information (for bugs)

Failure Logs

N/A

How to Reproduce

Run any app with run_option=PS and send SIGINT, launch_ps processes will not be terminated.

Related Issues

Memory usage

Hi,

I'd like to try using this. Seems really nice. I have a question (please note I'm new to TensorFlow)

I have N machines, each with one GPU, M Gigabytes of disk storage and J Gigabytes of memory, and my training dataset is accessible to all machines. How do I configure things so that when training in async data parallel mode (PS option) I can make sure only M and J Gigabytes are used per machine? So I can avoid any memory errors.

Could you setup a Google Groups for this project so we may ask these kinds of questions there?

Thank you!

snuspl / parallax Goto Github PK

parallax's People

Contributors

Stargazers

Watchers

Forkers

parallax's Issues

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Failure Logs

NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[43684,1],0] Exit code: 1

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

Failure Information (for bugs)

Failure Logs

How to Reproduce

Related Issues

Things to Change

Current Behavior

Expected Behavior

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[43684,1],0]
Exit code: 1