pbloem / former Goto Github PK

View Code? Open in Web Editor NEW

999.0 23.0 168.0 34.94 MB

Simple transformer implementation from scratch in pytorch.

Home Page: http://peterbloem.nl/blog/transformers

License: MIT License

Python 100.00%

machine-learning transformer pytorch

former's People

Contributors

Stargazers

Watchers

Forkers

embeddedsamurai shuxincheng zhouyonglong hivewang stjordanis hundred06 legendtianjin esvhd vdt tarsbase codeaudit koen-dejonghe sgnls ossdc hhy5277 huaizhengzhang mirekwilmer shenkev thomfoster sunchao3555 jamesbreon qianrenjian samzhaoziran malbergo xuedaniang drwitt u41ppp thaikeras joseph-zhong ttong-ai dunovank jin8 ajayws seyyaw zhaoxiaoliang-clh lorarjohns pepsalehi deniseduma johann-petrak kaharjan ibrahim85 primasanjaya siamakz dragomirradev milindparikh myausweis ai-ml-cv yaohaizhou fakhraddin zaemyung zhengger kbchoi00 xrosliang alirezasaberi wmmxk emilywebber dawnywu marianokamp franklindtx aniketrajpoot salamanderxing copperdong tamal2000 kewlcoder refaev yilu1021 awokeknowing bmanczak longjohncoder mahdiesrafili bismarckbamfo mgm79 sandy4321 garymihalik1 mldl bquast skumarr53 jbdatascience lis-kp nawshad marcel-busschers fsardari xc15071347094 jplasser k-oellers assansanogo jeffasante dennis9707 merouone navti dboyliao 65536william gheyret maulberto3 le0x99 ozturkosu deanhnter mthomp12 aayushsabharwal 3dalgolab

former's Issues

Question about k × hk weight matrices

Hi, first of all, thanks for this great explanation.

My question is, in the blog post under the section In Pytorch: complete self-attention it says

but it's actually more efficient to combine these for all heads into three single k×hk matrices

Shouldn't it be three hk × k matrices? Since this is what weights of Linear layers looks like?
Thanks

Comparing SelfAttention classes

Hello.

First of all, thanks a lot for your post Transformers from scratch, it is one of the best and most complete explanations about it that I've read.
I have a question regarding your implementation of the SelfAttention class, specially the computation of query, key, values for all heads, for example:

self.tokeys = nn.Linear(emb, emb * heads, bias=False)

I understood that as projecting an input X with emb dimensions into heads vector of emb dimensions.

However, other implementations define something such as:
self.dk = emb // heads
self.tokeys = nn.Linear(emb, emb)
and then perform something like:
keys = self.tokeys(x).view(nbatches, -1, heads, self.dk).transpose(1, 2)

which I understood as projecting an input X with emb dimensions and separating heads subvectors of dk dimensions in order to perform the attention operation. This is based on The Annotated Transformer code.

Did I understand your implementation correctly? If yes, are these implementations equivalent? Is there a significant difference between them in terms of "architectural power"?

Sorry for the long question.
Thanks in advance!

token_embedding for non-text sequences

Hi Peter,

Thanks for the insightful blog on how to build transformers from scratch. I'm experiencing what's more likely to be a user error than an actual code issue and was hoping you could provide me with a pointer on how to go about it.

In brief, I'm trying to perform sequence classification on multi-feature, non-text sequences. Specifically, each sequence is 5 features by 100 timepoints large and has one label. The data points include discrete locations in 2D space, cf. positions on a chessboard, and are all integers. The main issue probably resides in the fact that I'm not presenting the data correctly. During the first forward pass of the training data, when generating the token embedding (tokens = self.token_embedding(x) in Transformer), I'm getting:

File "xxx/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

What's unclear to me is whether this issue is due to mismatching tensor sizes or that my particular dataset is incompatible with the typical use case of nn.Embedding. For completion, self.token_embedding is 176 by 5, i.e. the number of unique rows/tokens within the dataset (hypothetically, the vocabulary size) vs. the number of features (hypothetical embedding size). Any pointers would be much appreciated.

Best,
Arjen

AttributeError: module 'torch' has no attribute 'triu_indices'

When I run the code and set the mask to True, the following problem occurs:

AttributeError: module 'torch' has no attribute 'triu_indices'

def mask_(matrices, maskval=0.0, mask_diagonal=True):
b, h, w = matrices.size()
indices = torch.triu_indices(h, w, offset=0 if mask_diagonal else 1)
matrices[:, indices[0], indices[1]] = maskval

High compute time, what is a reasonable generator model to get somewhat good results to play with?

Great resource and course! Thanks for the great work!

I converted parts of this repository to Google Colab and just ran the generate.py which would take something like 1-2 days to compute (~32-48 hrs). What is a reasonable num_batches value to train the generator model?

Regards,
Juergen

Hi,question about sliced-up version self-attention

in the blog you says there is a more efficient way of implementation? see lecture at the top. Do you mean the youtube vide at the top?

but there is no code explaination in the video , do i have to watch the video and implement myself or any blogs about the more efficient way of self attention?

thanks a lot!

Using the trained model

How do I use the trained model to generate text?

slide 50, v and q mixed up

where? slide 50, https://dlvu.github.io/slides/dlvu.lecture12.pdf

typo: the v and q input variables of source-attention layer are wrong.

"... In "encoder-decoder attention" layers, the queries come from the previous decoder layer,and the memory keys and values come from the output of the encoder..." (Vaswani et al, 2007, p.5)

Blog is down

The linked post at http://peterbloem.nl/blog/transformers does not resolve. Maybe a DNS record issue?

Module import error

$ python experiments/classify.py

Traceback (most recent call last):
File "experiments/classify.py", line 1, in
import former
File "/home/yky/miniconda3/envs/former/lib/python3.7/site-packages/former/init.py", line 1, in
from .modules import SelfAttention, SelfAttentionWide, TransformerBlock, SelfAttentionRelative, SelfAttentionNarrow
File "/home/yky/miniconda3/envs/former/lib/python3.7/site-packages/former/modules.py", line 1, in
from .util import mask_, d, slice_diag
ModuleNotFoundError: No module named 'former.util'

ModuleNotFoundError: No module named 'past'

I am practicing pytorch and got an error as follows.

Traceback (most recent call last):
File "D:/Programing/tensorboardtest.py", line 1, in
from torch.utils.tensorboard import SummaryWriter
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard_init_.py", line 6, in
from .writer import FileWriter, SummaryWriter # noqa F401
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard\writer.py", line 18, in
from ._convert_np import make_np
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard_convert_np.py", line 12, in
from caffe2.python import workspace
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\caffe2\python\workspace.py", line 15, in
from past.builtins import basestring
ModuleNotFoundError: No module named 'past'

Can any expert help me to solve?

ModuleNotFoundError: No module named 'past'

fix: pip install future

Accuracies on the examples

Hi, I wonder whether you can provide some reference accuracy for the experiments you have put on the page: https://github.com/pbloem/former/tree/master/experiments

That helps me understand how well this model is.

Thanks

conda installation failed

Thanks for the nice post on transformer!

However, this repo seems failed to install with current conda.

ResolvePackageNotFound:
  - openssl==1.1.1c=h7b6447c_1
  - ncurses==6.1=he6710b0_1
  - libstdcxx-ng==9.1.0=hdf63c60_0
  - zlib==1.2.11=h7b6447c_3
  - libedit==3.1.20181209=hc058e9b_0
  - libffi==3.2.1=hd88cf55_4
  - tk==8.6.8=hbc83047_0
  - sqlite==3.29.0=h7b6447c_0
  - readline==7.0=h7b6447c_5
  - xz==5.2.4=h14c3975_4
  - python==3.7.4=h265db76_1
  - libgcc-ng==9.1.0=hdf63c60_0

Why the use of log_softmax ?

Trying to understand why you used log_softmaxinstead of softmax as it will not reproduce a probability distribution that sums to 1 in the end? I'm missing something?

Calculation of memory required in "Going big"

For a sequence length t, this is a dense matrix containing t2 elements. At standard 32-bit precision, and with t=1000 a batch of 16 such matrices takes up about 250Mb of memory.

Just a small question: how do we arrive at 250 Mb?

A single matrix would be 1 million elements, each of which is 4 bytes, is my understanding correct?
So a single matrix would require 4M bytes, or 4 Mb. And then we would need 16 of them for a batch, correct?
Then wouldn't this result into 64 Mb in total?

Just trying to check my understanding here. Thanks so much for the great article!

Question about the tool used to make the figures in the blog post

Hello, first thank you very much for the awesome post and the code. Clearly explained and illustrated.

I must say, the illustrations in the blog post are beautiful, I was wandering what tool you have used to make them.

Sorry for this unrelated question, feel free to close this issue if this is not the place to ask it.
Thanks.

Masking done for the upper or lower triangle?

Hi,

I believe the comment should be masking the upper not the lower triangle of the matrix? Both the blog and the code referred to the upper half of the matrix.

Thanks.

former/former/modules.py

Lines 58 to 59 in 7b12ae6

    
           if self.mask: # mask out the lower half of the dot matrix,including the diagonal 
        
               mask_(dot, maskval=float('-inf'), mask_diagonal=False)

No Field in torchtext.data

Thank you for the super explanation of the transformer!

I am a beginner and trying to learn the code. I notice that bot torchtext.legacy.data and torchtext.data.Field are depreciated. I get the error message AttributeError: module 'torchtext.data' has no attribute 'Field'.

Is there a solution for this error?

I highly appreciate your feedback!

Einsum to avoid transpose and reshape

Hi,

Thank you for the great post about Transformers.
Actually, you can avoid transpose/reshape using torch.einsum.

Here is an example that behaves exactly as your implementation (except mask=True, and asserts 😊):

def forward_einsum(self, x):
    b, t, e = x.size()
    h = self.heads

    keys    = self.tokeys(x).view(b, t, h, e)
    queries = self.toqueries(x).view(b, t, h, e)
    values  = self.tovalues(x).view(b, t, h, e)

    dot = torch.einsum('bthe,bihe->bhti', queries, keys) / math.sqrt(e)
    dot = F.softmax(dot, dim=-1)

    out = torch.einsum('bhtd,bdhe->bthe', dot, values)

    # we can move reshape of weights to init; I left it here just to compare with the original implementation
    out = torch.einsum('bthe,khe->btk', out, self.unifyheads.weight.view(e,h,e)) 
    return out + self.unifyheads.bias

Despite code became very short it's probably hard to understand for people that don't know einsum notation, so apparently, this is definitely not the best code to explain the idea 😊.

Issue with masking

Thanks for your great tutorial. I think your code on github is probably find, but I believe there is an error in the mask as defined in the post.

I believe the masking line should read

indices = torch.triu_indices(t, t, offset=1)

In the current implementation, dot should be t x t. Furthermore, an offset of 0 will create a row of all -inf, which gives a complete row of NaNs when fed into softmax. See pytorch/pytorch#24816

Weight matrix dimensions after transofrmation

This might be a dumb question.
Could you explain the dimensions of the weight matrix after transforming the input with K,Q,V?

I tried doing it myself , am I overlooking an obvious error here?
Assuming a batch size of 1 for simplicity.

Understanding multiheaded attention.

In the blog post you say we can implement multiheaded attention by splitting the input into chunks and processing each chunk separately. I tried to write down the equation for what this might look like. So say we have 4 vectors w, x, y, z. We use 3 heads so we can write x as x1, x2, x3 and so on for y, w and z. The attention weights for x_i, y_j is a_ij. So we can write
the transformed x after applying attention weights as. The capital X is simply [x1, x2, x3], and Axy is [ax1y1, ax2y2, ax3y3]
and so on and * represents element wise multiplication

x' = Axx * X + AxyY + Axw W + Axz* Z. This is with three heads

now if we werent splitting it into chunks we would do

x' = axx'*X + axy'*Y + axw'*W + axz'*Z with multiple heads so here say 3 (here small a means it is a scalar). Then we would have 3 such values and if we take the mean to get the values of x_final then we would get (axx'1 + axx'2 + axx'3) / 3 and hence we can write

x' = axx'm * X + axy'm * Y + axw'm * W + axz'm * Z where axy'm is (ax1y1m + ax2y2m + ax3y3m) / 3

So in the first case we are multiplying different dimensions of A with different weights while in the second case we are multiplying all dimensions of x by the same number. Can you explain this part please. I hope my question is clear

improve performance in deeper network using your multi head attention code

hi Mr. Peter.
Thanks for the insightful blog on how to build transformers from scratch. I am a master student. I used your code for my thesis. In my code, after the embedding layer (pretrained embedding + position embedding) I only used a multi-head attention layer and that was given to a Bi-LSTM network as input.
But the results do not improve due to the accuracy of the single Bi-LSTM model. What do you think is the reason for this problem? How do I fix it?
thanks.

data.Field no longer supported in torchtext

line 11 in classify.py should be changed to: "from torchtext.legacy import data, datasets, vocab"

Is narrow attention implemented correctly?

Hey man! Thanks so much for writing this amazing exposition on transformers. I understand things so much better because of you. And the einsum link just changed my life for ever.

I had a doubt regarding the Narrow Self-Attention implementation...

In former/modules.py, class SelfAttentionNarrow you have the following code for __init__:

former/former/modules.py

Lines 90 to 95 in 6a3295c

    
           s = emb // heads 
        
           # - We will break the embedding into `heads` chunks and feed each to a different attention head 
        
           self.tokeys    = nn.Linear(s, s, bias=False) 
        
           self.toqueries = nn.Linear(s, s, bias=False) 
        
           self.tovalues  = nn.Linear(s, s, bias=False)

Now in the forward method you do:

former/former/modules.py

Lines 105 to 110 in 6a3295c

    
           s = e // h 
        
           x = x.view(b, t, h, s) 
        
           keys    = self.tokeys(x) 
        
           queries = self.toqueries(x) 
        
           values  = self.tovalues(x)

Does this not mean that you only have one attention head rather than h of them? As all the h s-sized pieces of x are going through the same linear layer.
From my understanding there should be h Linear layers of size s x s each. Or a single weight matrix of shape h x s x s depending on your implementation.
Am I going wrong somewhere? It would be really awesome if you clarify.

Why is dividing by e**(1/4) for both keys and queries more memory efficient?

Hi,

Would you mind explaining why the follow code is more memory efficient than just dividing one of them by sqrt(e)?

former/former/modules.py

Lines 48 to 52 in 7b12ae6

    
           queries = queries / (e ** (1/4)) 
        
           keys    = keys / (e ** (1/4)) 
        
           # - Instead of dividing the dot products by sqrt(e), we scale the keys and values. 
        
           #   This should be more memory efficient

Thank you.

ImportError: attempted relative import with no known parent package

Hi @pbloem,
Thank you for a great blog post. Running

python experiments/classify.py

on my regular conda environment gives the output

(MyDeeplearningEnv) PS C:\Users\Ruben\Desktop\former> python experiments/classify.py
Traceback (most recent call last):
  File "experiments/classify.py", line 1, in <module>
    from _context import former
  File "C:\Users\Ruben\Desktop\former\experiments\_context.py", line 7, in <module>
    import former
  File "C:\Users\Ruben\Desktop\former\former\__init__.py", line 1, in <module>
    from .modules import SelfAttention, SelfAttentionWide, TransformerBlock, SelfAttentionRelative, SelfAttentionNarrow
  File "C:\Users\Ruben\Desktop\former\former\modules.py", line 1, in <module>
    from former import util
  File "C:\Users\Ruben\Desktop\former\former\util\__init__.py", line 1, in <module>
    from .util import mask_, d, here, contains_nan, tic, toc, \
  File "C:\Users\Ruben\Desktop\former\former\util\util.py", line 6, in <module>
    import transformers as trf
  File "C:\Users\Ruben\Desktop\former\former\transformers.py", line 5, in <module>
    from .modules import TransformerBlock
ImportError: attempted relative import with no known parent package

I tried creating a clean environment using

pip install torch tb-nightly tqdm numpy torchtext
pip install future

This resulted in more fstring errors which I removed and then got the same error as above. Creating a environment from the environment.yml fails to resolve.

Any helps is appreciated.

How to tokenize a testing phrase

Hi everyone! Thanks a lot for this nice tutorial and code to learning transformers!.

I am trying to recreate the sample of the tutorial:

https://peterbloem.nl/blog/transformers

And I was able to train and serialize a model for the IMDB Dataset.

Currently, I want to test the model with new validation phrases. Nevertheless, I cannot find a way to tokenize the phrase into the required data shape, as in the provided sample:

#Load dataset
tdata, _ = datasets.IMDB.splits(TEXT, LABEL)
train, test = tdata.split(split_ratio=0.8)

#Preprocess data
TEXT.build_vocab(train, max_size=50_000 - 2)
LABEL.build_vocab(train)

#Create iterators
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=4, device=util.d())

I see that the tokens are generated in some part of the BucketIterator (or the dataset itself):

for batch in tqdm.tqdm(test_iter):

    input = batch.text[0]
    label = batch.label - 1

As in the dataset , I can see the phrases separated into words:

print(test_iter.data()[0].text)
print(test_iter.data()[0].label)

generates:

['i', "wouldn't", 'rent', 'this', 'one', 'even', 'on', 'dollar', 'rental', 'night.']
neg

So, if I want to test a pharse in the model. Like:

#Try the model
input = ["this", "movie", "is", "incredible", "boring"]

How can I tokenize the word in a correct way to feed it into the model?.

Thanks in advance for your response.

Greetings!

Weight scaling should be for keys not values?

Hi @pbloem ,

Thanks for the great post. Best I've read on this topic.

Reading through the code, when performing weight scaling, I wonder whether the line here should be scaling keys not values?

former/former/modules.py

Lines 48 to 49 in 2077d00

    
           queries = queries / (e ** (1/4)) 
        
           values  = values  / (e ** (1/4))

A couple of lines down from here we have

former/former/modules.py

Line 54 in 2077d00

dot = torch.bmm(queries, keys.transpose(1, 2))

Best.

How can we visualize the self-attention map

Hi, @pbloem ,

How can we visualize the self-attention map? Could you provide the corresponding code snippet?

Thanks!

	if self.mask: # mask out the lower half of the dot matrix,including the diagonal
	mask_(dot, maskval=float('-inf'), mask_diagonal=False)

	s = emb // heads
	# - We will break the embedding into `heads` chunks and feed each to a different attention head

	self.tokeys = nn.Linear(s, s, bias=False)
	self.toqueries = nn.Linear(s, s, bias=False)
	self.tovalues = nn.Linear(s, s, bias=False)

	s = e // h
	x = x.view(b, t, h, s)

	keys = self.tokeys(x)
	queries = self.toqueries(x)
	values = self.tovalues(x)

	queries = queries / (e ** (1/4))
	keys = keys / (e ** (1/4))
	# - Instead of dividing the dot products by sqrt(e), we scale the keys and values.
	# This should be more memory efficient

	queries = queries / (e ** (1/4))
	values = values / (e ** (1/4))

pbloem / former Goto Github PK

former's People

Contributors

Stargazers

Watchers

Forkers

former's Issues

Recommend Projects

Recommend Topics

Recommend Org