Does transformer-based asr model need less time than LSTM in the inference stage? about returnn-experiments HOT 7 CLOSED

rwth-i6 commented on June 4, 2024

Does transformer-based asr model need less time than LSTM in the inference stage?

from returnn-experiments.

Comments (7)

scharoun commented on June 4, 2024

transformer-based model seems need more layers? so, it must affect inference performance?

from returnn-experiments.

albertz commented on June 4, 2024

Hi,

Yes, Transformer can train faster, although our models are often quite big, and seems to need longer to converge, and also we have a very fast native CUDA LSTM implementation, so that in the end there is not too much a difference. But maybe @kazuki-irie can comment more on that.

For inference, Transformer models are usually slower. Also, there is some quadratic component in the runtime/memory complexity, which dominates at some point (for some longer seq length). Maybe @kazuki-irie or @curufinwe can give some more details on that.

from returnn-experiments.

kazuki-irie commented on June 4, 2024

I suppose the original question was about the encoder decoder ASR models (not language models, correct?).
So I do not think I have anything I can add to @albertz's answer.

from returnn-experiments.

albertz commented on June 4, 2024

Ah, sorry, I somehow assumed LM models. But for the ASR models, the situation is very similar, so what I said should be correct as well. For training times, you can also see our ASRU paper where we compare Transformer vs LSTM for ASR.

from returnn-experiments.

scharoun commented on June 4, 2024

@albertz Thanks for your reply! Did you compare inference time?

from returnn-experiments.

albertz commented on June 4, 2024

We did, but I'm not sure if we have some tables showing systematic comparisons. But what I said is also what we observed in experiments:

For inference, Transformer models are usually slower. Also, there is some quadratic component in the runtime/memory complexity, which dominates at some point (for some longer seq length).

The quadratic component cannot really be changed, unless you change the model. So the original model will never work on long sequences. But there are various solutions to that, which modify the Transformer model, to get rid of the quadratic component.

Even considering some maximum seq length, the Transformer model is slower, and takes more memory. This can be reduced by less self attention in the model. But the question is how to do that while still keeping good performance. @kazuki-irie is working on that.

from returnn-experiments.

scharoun commented on June 4, 2024

@albertz Thank you, i get it

from returnn-experiments.

Recommend Projects

Does transformer-based asr model need less time than LSTM in the inference stage? about returnn-experiments HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent