I believe there is an issue with using TimeDistributed(Batch

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Use of TimeDistributed(BatchNormalization()) about keras-rcnn HOT 15 CLOSED

broadinstitute commented on September 14, 2024

Use of TimeDistributed(BatchNormalization())

from keras-rcnn.

Comments (15)

0x00b1 commented on September 14, 2024

Hi, @yhenon!

I appreciate the comment and the explanation (I was puzzled by your custom BatchNormalization implementation)! I’ll take a look. The BatchNormalization(axis=bn_axis+1) sounds like the better way to go!

I’m curious, would you be interested in helping out? Your package was certainly an inspiration!

from keras-rcnn.

JihongJu commented on September 14, 2024

I was not awared of this problem. Thank you for pointing out this issue. As far as I understood, the TimeDistributed layer should apply to a tensor of shape without the time dimension. If this is not the case for BatchNormalization, it might be an issue for keras as well because that would be inconsistent with the other layers. I'm not sure if the issue was caused by the extra dimension or others. That seems interesting and I will look into it.

from keras-rcnn.

JihongJu commented on September 14, 2024

Hi @yhenon ,

I've tried the TimeDistributed BatchNormalization with the following sample:

import numpy as np
from keras.layers import *
from keras.models import *
import keras.backend as K

img_size = 8
batch_size = 64
num_time_steps = 4
num_channels = 3
K.set_learning_phase(1)
X = np.random.rand(batch_size, num_time_steps,
                   img_size, img_size, num_channels)
x = K.variable(X)
y = TimeDistributed(BatchNormalization(axis=-1))(x)
print(K.int_shape(y))

norm = K.eval(y)
for i in range(num_time_steps):
    for j in range(num_channels):
        print(norm[:, i, ..., j].mean(), norm[:, i, ..., j].std())

And the results were:

(64, 4, 8, 8, 3)
-5.02914e-08 0.994199
2.79397e-09 0.99404
-2.79397e-08 0.994136
-3.21306e-08 0.994103
3.53903e-08 0.99412
2.79397e-08 0.994144
3.35276e-08 0.993953
-8.3819e-09 0.994113
1.11759e-08 0.994066
-8.73115e-08 0.993973
-7.45058e-09 0.993843
-6.61239e-08 0.99407

This seems to match what we desired from the BatchNormalization. It returns normalized activations per batch and the normalization was applied independently to all the data streams.

from keras-rcnn.

yhenon commented on September 14, 2024

@JihongJu After looking over your code, I still think my issue stands (though I may be missing something). To clarify my point a bit:

TimeDistributed(BatchNorm()) seems to work fine at training time (as you point out), as it normalizes using the statistics of the mini-batch
TimeDistributed(BatchNorm()) does not work fine at test time, as it normalizes using statistics computed on the training set. However, these statistics never get updated when the BN layer is in a TimeDistributed wrapper.

The problem stands from your line K.set_learning_phase(1), which uses BN in train mode. However, having K.set_learning_phase(1) as test time since it makes a number of layers behave undesirably (like dropout).

Here's a more complete example, where we compute the stats on a batch at both train and test time, using both approaches to BN:

from keras.layers import *
from keras.models import *
import keras.backend as K

def test_bn(batch_norm_type, learning_phase):
	K.set_learning_phase(learning_phase)

	img_size = 8
	batch_size = 64
	num_time_steps = 4
	num_channels = 3

	inputs = Input(shape=(num_time_steps, img_size, img_size, num_channels))
	
	if batch_norm_type == 'time_dist':
		# momentum increased for faster update of dataset statistics
		x = TimeDistributed(BatchNormalization(axis=-1, momentum=0.5))(inputs)
	elif batch_norm_type == 'flat':
		x = BatchNormalization(axis=4, momentum=0.5)(inputs)

	model = Model(inputs=inputs, outputs=x)
	model.compile(loss='mae', optimizer='sgd')

	X = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
	Y = np.random.rand(batch_size, num_time_steps, img_size, img_size, num_channels)
	history = model.fit(X, Y, epochs=4, verbose=0)

	P = model.predict(X)

	print('bn_type: {:10} | learning_phase: {} | mean: {:14} | std: {:14}'.format(
		batch_norm_type, learning_phase, 
		P.mean(), P.std()))
	return

for batch_norm_type in ['time_dist', 'flat']:
	for learning_phase in [0, 1]:
		test_bn(batch_norm_type, learning_phase)

And the corresponding output:

bn_type: time_dist  | learning_phase: 0 | mean: 0.498111873865 | std: 0.287745058537
bn_type: time_dist  | learning_phase: 1 | mean: 0.00772533146665 | std: 0.973870813847
bn_type: flat       | learning_phase: 0 | mean: 0.0119582833722 | std: 0.961674869061
bn_type: flat       | learning_phase: 1 | mean: 0.0076981917955 | std: 0.973761022091

from keras-rcnn.

yhenon commented on September 14, 2024

@0x00b1 Hi!
To be clear, in my implementation, I was just implementing what the paper said:

For the usage of BN layers, after pretraining, we compute the BN statistics (means and variances) for each layer on the ImageNet training set. Then the BN layers are fixed during fine-tuning for object detection. As such, the BN layers become linear activations with constant offsets and scales, and BN statistics are not updated by fine-tuning. We fix the BN layers mainly for reducing memory consumption in Faster R-CNN training.

This also provided a way of dealing with the above issue, so I left it.

I would certainly be interested in helping - my original implementation is rather limited in scope and full of hacks, amd a better quality keras frcnn would be desirable.

from keras-rcnn.

JihongJu commented on September 14, 2024

@yhenon Hmm, now I get the point. In that case, I agree with you, adding a flat BN, instead of a time distributed BN, to the 5D tensor seems fine.

from keras-rcnn.

hgaiser commented on September 14, 2024

Can I abuse this issue to ask why TimeDistributed layers are necessary? Is it to perform computation per ROI (meaning the term 'time distributed' is a bit poorly chosen here)? I noticed in py-faster-rcnn that they are limited to single batch training only, presumably because Caffe blobs are limited to 4d. If you have batch_size > 1 and ROIs, your blob would need 5 dimensions (batch_id, roi_id, height, width, channels). Is the use of TimeDistributed intended to get this fifth dimension?

In addition, I noticed that for Keras the moving average / variation is not updated when in test mode (see here). Wouldn't that be an issue? Shouldn't it be updated during test mode? Should this be fixed in Keras? So many questions :)

from keras-rcnn.

0x00b1 commented on September 14, 2024

@JihongJu I played with this too. I think @yhenon is correct. And I believe the suggestion by @yhenon will work (i.e. BatchNormalization(axis=bn_axis + 1)).

@yhenon Want to send a PR? 😄

from keras-rcnn.

0x00b1 commented on September 14, 2024

Can I abuse this issue to ask why TimeDistributed layers are necessary? Is it to perform computation per ROI (meaning the term 'time distributed' is a bit poorly chosen here)? I noticed in py-faster-rcnn that they are limited to single batch training only, presumably because Caffe blobs are limited to 4d. If you have batch_size > 1 and ROIs, your blob would need 5 dimensions (batch_id, roi_id, height, width, channels). Is the use of TimeDistributed intended to get this fifth dimension?

Yep. Your instincts are right. It’s a super clever hack by @yhenon to exploit the TimeDistributed wrapper’s batching to iterate across a variable number of regions. And, I agree, TimeDistributed is a bad name. I think Distributed (or Batched) would make more sense. (cc: @fchollet)

from keras-rcnn.

0x00b1 commented on September 14, 2024

In addition, I noticed that for Keras the moving average / variation is not updated when in test mode (see here). Wouldn't that be an issue? Shouldn't it be updated during test mode? Should this be fixed in Keras? So many questions :)

Hrm. Why do you think it should be updated during test (i.e. inference or prediction)?

from keras-rcnn.

hgaiser commented on September 14, 2024

Hrm. Why do you think it should be updated during test (i.e. inference or prediction)?

I'm not sure, but it sounds like the moving average / variation is depending on your current data, not on the data you trained on. I will read more today on BatchNormalization to see how it should be.

from keras-rcnn.

waleedka commented on September 14, 2024

I pushed PR to fix this issue here keras-team/keras#7467. I believe it's a more generic solution than the bn_axis+1 solution, and fixes the root problem in the TimeDistributed layer.

from keras-rcnn.

yhenon commented on September 14, 2024

Thanks to @waleedka for making that PR which has now been merged!
Re-running the above snippet with a freshly checked out keras install gives:

bn_type: time_dist  | learning_phase: 0 | mean: 0.0126442806795 | std: 0.960675358772
bn_type: time_dist  | learning_phase: 1 | mean: 0.00776057131588 | std: 0.973823308945
bn_type: flat       | learning_phase: 0 | mean: 0.0131619861349 | std: 0.961250126362
bn_type: flat       | learning_phase: 1 | mean: 0.00772643135861 | std:  0.97383749485

Which is the desired output. This should keep the API a bit simpler, since TimeDistributed() can now be applied to all layers in the final stage classifier. It means people will need to update their keras version to the latest, but that's ok.

from keras-rcnn.

0x00b1 commented on September 14, 2024

Awesome! Thanks for the update, @yhenon and thanks for the work @waleedka!

@waleedka please feel free to add yourself to the CONTRIBUTORS file!

from keras-rcnn.

subhashree-r commented on September 14, 2024

What is the best way to extend this script to a batch inference / training? @yhenon

from keras-rcnn.

Use of TimeDistributed(BatchNormalization()) about keras-rcnn HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent