pbloem / former Goto Github PK
View Code? Open in Web Editor NEWSimple transformer implementation from scratch in pytorch.
Home Page: http://peterbloem.nl/blog/transformers
License: MIT License
Simple transformer implementation from scratch in pytorch.
Home Page: http://peterbloem.nl/blog/transformers
License: MIT License
Hi, first of all, thanks for this great explanation.
My question is, in the blog post under the section In Pytorch: complete self-attention it says
but it's actually more efficient to combine these for all heads into three single k×hk matrices
Shouldn't it be three hk × k
matrices? Since this is what weights of Linear layers looks like?
Thanks
Hello.
First of all, thanks a lot for your post Transformers from scratch, it is one of the best and most complete explanations about it that I've read.
I have a question regarding your implementation of the SelfAttention class, specially the computation of query, key, values for all heads, for example:
self.tokeys = nn.Linear(emb, emb * heads, bias=False)
I understood that as projecting an input X with emb dimensions into heads vector of emb dimensions.
However, other implementations define something such as:
self.dk = emb // heads
self.tokeys = nn.Linear(emb, emb)
and then perform something like:
keys = self.tokeys(x).view(nbatches, -1, heads, self.dk).transpose(1, 2)
which I understood as projecting an input X with emb dimensions and separating heads subvectors of dk dimensions in order to perform the attention operation. This is based on The Annotated Transformer code.
Did I understand your implementation correctly? If yes, are these implementations equivalent? Is there a significant difference between them in terms of "architectural power"?
Sorry for the long question.
Thanks in advance!
Hi Peter,
Thanks for the insightful blog on how to build transformers from scratch. I'm experiencing what's more likely to be a user error than an actual code issue and was hoping you could provide me with a pointer on how to go about it.
In brief, I'm trying to perform sequence classification on multi-feature, non-text sequences. Specifically, each sequence is 5 features by 100 timepoints large and has one label. The data points include discrete locations in 2D space, cf. positions on a chessboard, and are all integers. The main issue probably resides in the fact that I'm not presenting the data correctly. During the first forward pass of the training data, when generating the token embedding (tokens = self.token_embedding(x) in Transformer), I'm getting:
File "xxx/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
What's unclear to me is whether this issue is due to mismatching tensor sizes or that my particular dataset is incompatible with the typical use case of nn.Embedding. For completion, self.token_embedding is 176 by 5, i.e. the number of unique rows/tokens within the dataset (hypothetically, the vocabulary size) vs. the number of features (hypothetical embedding size). Any pointers would be much appreciated.
Best,
Arjen
When I run the code and set the mask to True, the following problem occurs:
AttributeError: module 'torch' has no attribute 'triu_indices'
def mask_(matrices, maskval=0.0, mask_diagonal=True):
b, h, w = matrices.size()
indices = torch.triu_indices(h, w, offset=0 if mask_diagonal else 1)
matrices[:, indices[0], indices[1]] = maskval
Great resource and course! Thanks for the great work!
I converted parts of this repository to Google Colab and just ran the generate.py which would take something like 1-2 days to compute (~32-48 hrs). What is a reasonable num_batches value to train the generator model?
Regards,
Juergen
in the blog you says there is a more efficient way of implementation? see lecture at the top. Do you mean the youtube vide at the top?
but there is no code explaination in the video , do i have to watch the video and implement myself or any blogs about the more efficient way of self attention?
thanks a lot!
How do I use the trained model to generate text?
where? slide 50, https://dlvu.github.io/slides/dlvu.lecture12.pdf
typo: the v
and q
input variables of source-attention layer are wrong.
"... In "encoder-decoder attention" layers, the queries come from the previous decoder layer,and the memory keys and values come from the output of the encoder..." (Vaswani et al, 2007, p.5)
The linked post at http://peterbloem.nl/blog/transformers does not resolve. Maybe a DNS record issue?
$ python experiments/classify.py
Traceback (most recent call last):
File "experiments/classify.py", line 1, in
import former
File "/home/yky/miniconda3/envs/former/lib/python3.7/site-packages/former/init.py", line 1, in
from .modules import SelfAttention, SelfAttentionWide, TransformerBlock, SelfAttentionRelative, SelfAttentionNarrow
File "/home/yky/miniconda3/envs/former/lib/python3.7/site-packages/former/modules.py", line 1, in
from .util import mask_, d, slice_diag
ModuleNotFoundError: No module named 'former.util'
I am practicing pytorch and got an error as follows.
Traceback (most recent call last):
File "D:/Programing/tensorboardtest.py", line 1, in
from torch.utils.tensorboard import SummaryWriter
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard_init_.py", line 6, in
from .writer import FileWriter, SummaryWriter # noqa F401
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard\writer.py", line 18, in
from ._convert_np import make_np
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard_convert_np.py", line 12, in
from caffe2.python import workspace
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\caffe2\python\workspace.py", line 15, in
from past.builtins import basestring
ModuleNotFoundError: No module named 'past'
Can any expert help me to solve?
fix: pip install future
Hi, I wonder whether you can provide some reference accuracy for the experiments you have put on the page: https://github.com/pbloem/former/tree/master/experiments
That helps me understand how well this model is.
Thanks
Thanks for the nice post on transformer!
However, this repo seems failed to install with current conda.
ResolvePackageNotFound:
- openssl==1.1.1c=h7b6447c_1
- ncurses==6.1=he6710b0_1
- libstdcxx-ng==9.1.0=hdf63c60_0
- zlib==1.2.11=h7b6447c_3
- libedit==3.1.20181209=hc058e9b_0
- libffi==3.2.1=hd88cf55_4
- tk==8.6.8=hbc83047_0
- sqlite==3.29.0=h7b6447c_0
- readline==7.0=h7b6447c_5
- xz==5.2.4=h14c3975_4
- python==3.7.4=h265db76_1
- libgcc-ng==9.1.0=hdf63c60_0
Trying to understand why you used log_softmax
instead of softmax
as it will not reproduce a probability distribution that sums to 1 in the end? I'm missing something?
For a sequence length t, this is a dense matrix containing t2 elements. At standard 32-bit precision, and with t=1000 a batch of 16 such matrices takes up about 250Mb of memory.
Just a small question: how do we arrive at 250 Mb?
A single matrix would be 1 million elements, each of which is 4 bytes, is my understanding correct?
So a single matrix would require 4M bytes, or 4 Mb. And then we would need 16 of them for a batch, correct?
Then wouldn't this result into 64 Mb in total?
Just trying to check my understanding here. Thanks so much for the great article!
Hello, first thank you very much for the awesome post and the code. Clearly explained and illustrated.
I must say, the illustrations in the blog post are beautiful, I was wandering what tool you have used to make them.
Sorry for this unrelated question, feel free to close this issue if this is not the place to ask it.
Thanks.
Hi,
I believe the comment should be masking the upper not the lower triangle of the matrix? Both the blog and the code referred to the upper half of the matrix.
Thanks.
Lines 58 to 59 in 7b12ae6
Thank you for the super explanation of the transformer!
I am a beginner and trying to learn the code. I notice that bot torchtext.legacy.data and torchtext.data.Field are depreciated. I get the error message AttributeError: module 'torchtext.data' has no attribute 'Field'.
Is there a solution for this error?
I highly appreciate your feedback!
Hi,
Thank you for the great post about Transformers.
Actually, you can avoid transpose/reshape using torch.einsum.
Here is an example that behaves exactly as your implementation (except mask=True, and asserts 😊):
def forward_einsum(self, x):
b, t, e = x.size()
h = self.heads
keys = self.tokeys(x).view(b, t, h, e)
queries = self.toqueries(x).view(b, t, h, e)
values = self.tovalues(x).view(b, t, h, e)
dot = torch.einsum('bthe,bihe->bhti', queries, keys) / math.sqrt(e)
dot = F.softmax(dot, dim=-1)
out = torch.einsum('bhtd,bdhe->bthe', dot, values)
# we can move reshape of weights to init; I left it here just to compare with the original implementation
out = torch.einsum('bthe,khe->btk', out, self.unifyheads.weight.view(e,h,e))
return out + self.unifyheads.bias
Despite code became very short it's probably hard to understand for people that don't know einsum notation, so apparently, this is definitely not the best code to explain the idea 😊.
Thanks for your great tutorial. I think your code on github is probably find, but I believe there is an error in the mask as defined in the post.
I believe the masking line should read
indices = torch.triu_indices(t, t, offset=1)
In the current implementation, dot should be t x t. Furthermore, an offset of 0 will create a row of all -inf, which gives a complete row of NaNs when fed into softmax. See pytorch/pytorch#24816
In the blog post you say we can implement multiheaded attention by splitting the input into chunks and processing each chunk separately. I tried to write down the equation for what this might look like. So say we have 4 vectors w, x, y, z. We use 3 heads so we can write x as x1, x2, x3 and so on for y, w and z. The attention weights for x_i, y_j is a_ij. So we can write
the transformed x after applying attention weights as. The capital X is simply [x1, x2, x3], and Axy is [ax1y1, ax2y2, ax3y3]
and so on and * represents element wise multiplication
x' = Axx * X + AxyY + Axw W + Axz* Z. This is with three heads
now if we werent splitting it into chunks we would do
x' = axx'*X + axy'*Y + axw'*W + axz'*Z with multiple heads so here say 3 (here small a means it is a scalar). Then we would have 3 such values and if we take the mean to get the values of x_final then we would get (axx'1 + axx'2 + axx'3) / 3 and hence we can write
x' = axx'm * X + axy'm * Y + axw'm * W + axz'm * Z where axy'm is (ax1y1m + ax2y2m + ax3y3m) / 3
So in the first case we are multiplying different dimensions of A with different weights while in the second case we are multiplying all dimensions of x by the same number. Can you explain this part please. I hope my question is clear
hi Mr. Peter.
Thanks for the insightful blog on how to build transformers from scratch. I am a master student. I used your code for my thesis. In my code, after the embedding layer (pretrained embedding + position embedding) I only used a multi-head attention layer and that was given to a Bi-LSTM network as input.
But the results do not improve due to the accuracy of the single Bi-LSTM model. What do you think is the reason for this problem? How do I fix it?
thanks.
line 11 in classify.py should be changed to: "from torchtext.legacy import data, datasets, vocab"
Hey man! Thanks so much for writing this amazing exposition on transformers. I understand things so much better because of you. And the einsum link just changed my life for ever.
I had a doubt regarding the Narrow Self-Attention implementation...
In former/modules.py
, class SelfAttentionNarrow
you have the following code for __init__
:
Lines 90 to 95 in 6a3295c
Now in the forward
method you do:
Lines 105 to 110 in 6a3295c
Does this not mean that you only have one attention head rather than h
of them? As all the h
s
-sized pieces of x
are going through the same linear layer.
From my understanding there should be h
Linear layers of size s
x s
each. Or a single weight matrix of shape h
x s
x s
depending on your implementation.
Am I going wrong somewhere? It would be really awesome if you clarify.
Hi,
Would you mind explaining why the follow code is more memory efficient than just dividing one of them by sqrt(e)
?
Lines 48 to 52 in 7b12ae6
Thank you.
Hi @pbloem,
Thank you for a great blog post. Running
python experiments/classify.py
on my regular conda environment gives the output
(MyDeeplearningEnv) PS C:\Users\Ruben\Desktop\former> python experiments/classify.py
Traceback (most recent call last):
File "experiments/classify.py", line 1, in <module>
from _context import former
File "C:\Users\Ruben\Desktop\former\experiments\_context.py", line 7, in <module>
import former
File "C:\Users\Ruben\Desktop\former\former\__init__.py", line 1, in <module>
from .modules import SelfAttention, SelfAttentionWide, TransformerBlock, SelfAttentionRelative, SelfAttentionNarrow
File "C:\Users\Ruben\Desktop\former\former\modules.py", line 1, in <module>
from former import util
File "C:\Users\Ruben\Desktop\former\former\util\__init__.py", line 1, in <module>
from .util import mask_, d, here, contains_nan, tic, toc, \
File "C:\Users\Ruben\Desktop\former\former\util\util.py", line 6, in <module>
import transformers as trf
File "C:\Users\Ruben\Desktop\former\former\transformers.py", line 5, in <module>
from .modules import TransformerBlock
ImportError: attempted relative import with no known parent package
I tried creating a clean environment using
pip install torch tb-nightly tqdm numpy torchtext
pip install future
This resulted in more fstring errors which I removed and then got the same error as above. Creating a environment from the environment.yml fails to resolve.
Any helps is appreciated.
Hi everyone! Thanks a lot for this nice tutorial and code to learning transformers!.
I am trying to recreate the sample of the tutorial:
https://peterbloem.nl/blog/transformers
And I was able to train and serialize a model for the IMDB Dataset.
Currently, I want to test the model with new validation phrases. Nevertheless, I cannot find a way to tokenize the phrase into the required data shape, as in the provided sample:
#Load dataset
tdata, _ = datasets.IMDB.splits(TEXT, LABEL)
train, test = tdata.split(split_ratio=0.8)
#Preprocess data
TEXT.build_vocab(train, max_size=50_000 - 2)
LABEL.build_vocab(train)
#Create iterators
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=4, device=util.d())
I see that the tokens are generated in some part of the BucketIterator (or the dataset itself):
for batch in tqdm.tqdm(test_iter):
input = batch.text[0]
label = batch.label - 1
As in the dataset , I can see the phrases separated into words:
print(test_iter.data()[0].text)
print(test_iter.data()[0].label)
generates:
['i', "wouldn't", 'rent', 'this', 'one', 'even', 'on', 'dollar', 'rental', 'night.']
neg
So, if I want to test a pharse in the model. Like:
#Try the model
input = ["this", "movie", "is", "incredible", "boring"]
How can I tokenize the word in a correct way to feed it into the model?.
Thanks in advance for your response.
Greetings!
Hi, @pbloem ,
How can we visualize the self-attention map? Could you provide the corresponding code snippet?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.