Comments (10)
Yes, Layer Norm support is planned, though no specific ETA yet.
from haste.
@bratao, IndRNN
is complete and ready for prime-time use. I've kept this issue open to track the full LayerNormIndRNN
implementation. That still needs some work which I haven't gotten around to yet.
from haste.
Sounds like a reasonable addition. The IndRNN paper doesn't mention layer norm – do you have an algorithm that's known to produce good results? I'm wondering if we can get away with applying layer norm only on the input like this:
How important are cell clipping, input dropout, recurrent dropout, and non-tanh
activation functions in practice? And what weight initialization scheme do you propose?
from haste.
@sharvil Wonderful.
WEIGHT INITIALIZATION: author repo
kernel
: authors recommend Normal w/ small std. In my application, 'glorot_normal'
and 'glorot_uniform'
have yielded comparable results (didn't try others); note that normal is truncated (I suggest all default Normals are truncated to avoid unlikely but possible extreme weight values). I'd default to 'glorot_normal'
, but no strong inclination.
recurrent_kernel
: authors recommend a sophisticated initialization scheme w/ timesteps-based clipping. Several points:
timesteps
-based clipping will be a bit involved to implement in Keras API-friendly manner.- I'm not very convinced by the need for such elaborate clipping; per this graph I made based on authors' excerpt, the difference between clipped and simply
[-1, 1]
weights is quite small for long sequences, and authors themselves note clipping to be redundant for short (<20 timesteps) sequences. More importantly, being around1
at all may be harmful (see below). - Most implementations default to uniform
[-1, 1]
; I recommend against this. In my application,[-.2, .2]
has worked best, and[-1, 1]
was a wreck; my explanation is, large weights yield large pre-activations, drivingtanh
into saturation (andrelu
into explosion), harming backprop for long sequences. With my scheme, I achieved near picture-perfect gradients for 160+ timesteps. - My recommendation is:
[-.5, .5]
(uniform), with a docstring mention on difference w/ authors. IndRNNs are likelier to be used for long sequence tasks, where[-1, 1]
can be a bad default. Caveats:
- My experiments are limited to signal classification; results may differ in other domains
- If defaulting to
[-1, 1]
, it's worth mentioning trying smaller bounds for longer sequences in a docstring
- Whatever the default, I suggest
haste
provide a more convenient way to initialize via uniform or (truncated) normal. TF/Keras requires an import; instead, we can take adict
like{'uniform': 1}
to meanRandomUniform(-1, 1)
or{'normal': .01}
to meanTruncatedNormal(stdev=.01)
.
bias
: use_bias=True
, initialize to zeros. Same as authors'.
ACTIVATION:
'relu'
was a (bad) bomb in my application; 'selu'
was stabler - but this is rather inherent to long sequences. Authors' success may be domain-specific; for a safer general default, and what proved superior in my case, I recommend 'tanh'
.
LAYER NORMALIZATION:
The benefits of LayerNorm or BatchNorm, especially implemented recurrently for RNNs, are basically universal (some interesting reading here) - and will be even more pronounced for long sequence tasks with typical vanishing gradients. For IndRNNs, it remains important to normalize both input-to-input and hidden-to-hidden transforms, and separately; the authors of recurrent batch normalization go so far as to normalize gates separately, with sound arguments. The idea is, information flow dynamics are unique to each transform, and in IndRNN's recurrent_kernel
is additionally distinctly a vector.
Btw, Recurrent BN may prove superior to Recurrent LN, though is harder to implement - but that's for another time.
DROPOUTS: should behave same as TF/Keras's SimpleRNN. Though, recurrent_dropout
as-is is problematic (for all RNNs) - I'll clarify another time; can mimic TF for starters.
from haste.
Thanks for the detailed writeup. I'm following your recommendations for the most part but I don't think dropout is a good addition. Specifically, recurrent dropout is known to be A Bad Idea™ for RNNs as you pointed out and dropout between layers can be implemented trivially by the caller – it doesn't need to be implemented by the Haste layer.
from haste.
@sharvil I see your commit history - good progress so far, though I hope LayerNorm support is planned as it can make IndRNNs vastly more powerful for very long sequences. Regarding recurrent dropout, I disagree - it can be a very effective regularizer if used properly, though I wouldn't follow TensorFlow's implementation for it - I'll dedicate an Issue to this sometime.
from haste.
I have tested IndRNN, do I understand correctly it is not productional yet?
Training fails as the gradient of indrnn_cell_36/recurrent_scale:0 is of wrong shape:
grad shape: (512, 512) (but all but first row are zeros)
weight shape: (512,)
from haste.
@amurashov thanks for the report. Looks like a bug in the tensor shape in my TensorFlow integration, which I've now fixed. IndRNN is at a functional stage, LayerNormIndRNN is usable but only performs layer norm on the inputs, not on the recurrent connection.
from haste.
Yes, appears fixed. Thanks!
from haste.
@sharvil Thank you for your awesome work. This is pure gold!!
Can we consider IndRNN as production ready?
from haste.
Related Issues (20)
- Install on pip on systems without cuda HOT 7
- Segmentation fault on Cuda 10.0 HOT 2
- Support zoneout on lstm cell state and add recurrent dropout HOT 2
- CUDA error: an illegal memory access was encountered HOT 6
- haste_pytorch: Gradient for kernel/recurrent_kernel becomes zero when trained on gpu HOT 4
- How to expose LayerNormGRUCell to python ? HOT 2
- Can't run haste layers in Keras HOT 12
- Biases in final IndRNN layer are 0 HOT 1
- Zoneout remains during eval() HOT 2
- return_state_sequence for tf version
- layer_norm_gru_cell HOT 1
- Can Bidirectional Rnn and multi-layer Rnn be supported? HOT 1
- Activation function in IndRNN HOT 1
- haste_pytorch does not install properly with conda cudatoolkit? HOT 3
- Feature request for cell classes for pytorch HOT 7
- `RNN`s with `zoneout > 0.0` have wrong gradients HOT 1
- haste_tf compilation fails with "‘bfloat16’ in namespace ‘Eigen’ does not name a type"
- Support for PyTorch packed sequences HOT 2
- Supporting RWKV (a RNN that can match transformer LM & zero-shot performance at 1B+ params)
- Nan loss when replace pytorch LSTM with your LSTM or LayerNormLSTM HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from haste.