ibm / matrix-capsules-with-em-routing Goto Github PK

View Code? Open in Web Editor NEW

87.0 12.0 26.0 539 KB

A TensorFlow implementation of "Matrix Capsules with EM Routing" by Hinton et al. (2018).

License: Apache License 2.0

Python 99.71% Shell 0.29%

capsules hinton matrix-capsules em-routing dynamic-routing capsnet capsule-networks

matrix-capsules-with-em-routing's People

Contributors

Stargazers

Watchers

matrix-capsules-with-em-routing's Issues

failed to run cuBLAS routine cublasGemmBatchedEx issue

Hi Ashley,
Thanks for your great work.
When I ran the code, it failed with below information:

*2019-10-03 13:25:00.047383: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-10-03 13:25:00.477672: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-03 13:25:00.478676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.695
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 7.53GiB
2019-10-03 13:25:00.478708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-10-03 13:25:08.880050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:25:08.880113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-10-03 13:25:08.880130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-10-03 13:25:08.881112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7286 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-10-03 13:25:09.813176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-10-03 13:25:09.813211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:25:09.813215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-10-03 13:25:09.813218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-10-03 13:25:09.813350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7286 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-10-03 13:26:52.896021: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_NOT_SUPPORTED
2019-10-03 13:26:52.897701: E tensorflow/stream_executor/cuda/cuda_blas.cc:2574] Internal: failed BLAS call, see log for details
2019-10-03 13:26:53 CRITICAL: Traceback (most recent call last):
File "/home/jeff/anaconda2/envs/tf_36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(args)
File "/home/jeff/anaconda2/envs/tf_36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jeff/anaconda2/envs/tf_36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[3612672,4,4], b.shape=[3612672,4,4], m=4, n=4, k=4, batch_size=3612672
[[{{node tower_0/lyr.conv_caps1/votes/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/lyr.conv_caps1/votes/Tile_1, tower_0/lyr.conv_caps1/votes/Tile, ^swap_out_tower_0/gradients/tower_0/lyr.conv_caps1/votes/MatMul_grad/MatMul_1_0, ^swap_out_tower_0/gradients/tower_0/lyr.conv_caps1/votes/MatMul_grad/MatMul_1)]]
[[{{node tower_0/class_caps/activation_out/_23}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1082_tower_0/class_caps/activation_out", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

My computer system information:
Linux Ubuntu 16.04
Nvidia GPU GeForce GTX 1070, 8G
CUDA 9.0/cuDNN 7.3
Python 3.6.8
Tensorflow version: 1.11.0-gpu

I met this problem first time with CUDA 9.2 and cuDNN 7.6, I downgraded them to CUDA 9.0 and cuDNN 7.3, but still has this issue.
I also tried to reduce 'bath_size' from 64 to 2, but still have the same problem. Any idea, why it failed?

pip requirements need to be fixed

I try to use pip install -r requirements.txt to install the dependencies.
However, some requirements are not met:

mkl-fft==1.0.12 (only 1.0.6 is available)
mkl-random==1.0.2 (only 1.0.1.1)
mkl-service==2.0.2 (not found)

spatial_routing_matrix = utl.create_routing_map(child_space=1, k=1, s=1) ?

Hi Ashley,
In layers.py, 'def fc_caps( )' creat spatial_routing_matrix with 'spatial_routing_matrix = utl.create_routing_map(child_space=1, k=1, s=1)' , where child_space is 1, but i think it's not necessary to be 1 over this point, you know along the tensor shape alteration flow before (64, 7, 7, 8, *) ---> (64, 5, 5, 16, *) , the child_space should be 5 instead of 1.
And with child_space=1, the newly generated spatial_routing_matrix has shape (1,1), that will make the 'em_routing()' thereafter incorrect.
How do you think about that? maybe my reasoning is wrong somewhere?

Kindly regards
Jeff

Great realization, great work!!

Routing by agreement with Transformer-based for NMT

Hello all :)

I’m trying to use Routing by agreement with TRANSFORMER-BASED for NMT task. The proposed idea is to use each output of head attention as an input capsule for a capsule network to fuse the semantic and spatial information from different heads to help boost the correction of sentence output. As below:

The implementation code is here, and Pytorch issue is here.

I have got so bad results. Kindly, I need and suggestion to work on.

I look forward to your feedback.

Try to continue to train from a checkpoint

Hi guys,

When I try to continue to train the network from a ckpt_dir, I use the flag "load_dir" to do that.
python3 train_val.py --load_dir=./logs/smallNORB/20200103_/train/checkpoint
But the code returns:
"load_ckpt directory exists but cannot find a valid
checkpoint to resore, consider using the reset flag"
I have checked the dir and there is some checkpoints from previous training.
Is there some mistakes that I made in this process?

The impact of validation set selected from test set

Dear Ashley,
Thanks for your work fixing some common mistakes in other open source implementations. I found that the validation set was selected from the test set according to the proportion of 10%. Does this lead to an inaccurate test result?

@ashleygritzman

cost_j_h = (beta_v + 0.5*tf.log(var_j)) ?

Hi Ashley,

For 'def m_step()' in em_routing.py, I can see you code 'cost_j_h = (beta_v + 0.5tf.log(var_j)) * rr_prime_sum * layer_norm_factor' prior 'cost_j = tf.reduce_sum(cost_j_h, axis=-1, keepdims=True, name="cost_j")' .
My question is whether this lead to beta_v been multiplied by 'h' times? because 'beta_v + 0.5tf.log(var_j)' will broadcast beta_v over all the h elements along the last dimension.
According formula (2) in 'M ATRIX CAPSULES WITH EM
ROUTING' paper, it should be put something like (beta_v + Sum of cost_j_h) instead of Sum of (beta_v + cost_j_h), how do you think? maybe i am wrong.