aliutkus / spe Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 9.0 15.04 MB

Relative Positional Encoding for Transformers with Linear Complexity

Python 15.80% Jupyter Notebook 84.08% Shell 0.12%

spe's People

Contributors

Stargazers

Watchers

Forkers

lahiruts wellwang maximzubkov trendingtechnology dinkofranceschi maxmax2016 ahmdtaha superxiang agarwalmanvi

spe's Issues

Pre trained models

I would like to ask for the pre trained models of the pop piano experiment used in the paper, could you please provide them?
Thanks

Wrong axis in jax spe summation

For the jax implementation, on line 210 of spe.py, should the axis summed over be -1 instead of -2? When using -2, the size of the last output dimension is num_realizations, rather than the query/key dimension:

return (spe[:, :keys.shape[1]] * keys[..., None]).sum(axis=-1)

Package release

Once the paper is published, we should put the packages on PyPI.

Very slow algorithm, is that normal?

Hello,

I implemented the algorithm in the vision transformer architecture the following way:

#inside __init__()
self.spe = SineSPE(num_heads=head_cnt,in_features=in_dim,num_sines=5,num_realizations=64)
self.filter = SPEFilter(gated=False,code_shape=self.spe.code_shape)

#inside forward()
q,k=self.filter(q,k,self.spe(q.shape[:2]))
qk,kp = performer(...)
out=lin_attention(...)

The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.

Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal?
Performers + ViT takes 39 minutes
Perfomers + ViT + SPE takes around 4 hours
For both I am using 2 Titan XP GPUs.

This is very problematic to me because I was considering scaling up those experiments with imagenet.

I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.

Many thanks!

Scale problem

Hey I am a little bit confused about the scale.

Inside SineSPE() you deal with the scale (both d^0.25 and num_realizations^0.25)
On the other hand when you show the application in pytorch, after applying the filter you divide by sqrt(num_realizations) again, why is that?
https://github.com/aliutkus/spe/blob/main/src/pytorch/examples/test_spe.ipynb

aliutkus / spe Goto Github PK

spe's People

Contributors

Stargazers

Watchers

Forkers

spe's Issues

Pre trained models

Wrong axis in jax spe summation

Package release

Very slow algorithm, is that normal?

Scale problem

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent