Giter Site home page Giter Site logo

pbloem / former Goto Github PK

View Code? Open in Web Editor NEW
999.0 23.0 168.0 34.94 MB

Simple transformer implementation from scratch in pytorch.

Home Page: http://peterbloem.nl/blog/transformers

License: MIT License

Python 100.00%
machine-learning transformer pytorch

former's People

Contributors

koen-dejonghe avatar pbloem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

former's Issues

Question about k × hk weight matrices

Hi, first of all, thanks for this great explanation.

My question is, in the blog post under the section In Pytorch: complete self-attention it says

but it's actually more efficient to combine these for all heads into three single k×hk matrices

Shouldn't it be three hk × k matrices? Since this is what weights of Linear layers looks like?
Thanks

Comparing SelfAttention classes

Hello.

First of all, thanks a lot for your post Transformers from scratch, it is one of the best and most complete explanations about it that I've read.
I have a question regarding your implementation of the SelfAttention class, specially the computation of query, key, values for all heads, for example:

self.tokeys = nn.Linear(emb, emb * heads, bias=False)

I understood that as projecting an input X with emb dimensions into heads vector of emb dimensions.

However, other implementations define something such as:
self.dk = emb // heads
self.tokeys = nn.Linear(emb, emb)
and then perform something like:
keys = self.tokeys(x).view(nbatches, -1, heads, self.dk).transpose(1, 2)

which I understood as projecting an input X with emb dimensions and separating heads subvectors of dk dimensions in order to perform the attention operation. This is based on The Annotated Transformer code.

Did I understand your implementation correctly? If yes, are these implementations equivalent? Is there a significant difference between them in terms of "architectural power"?

Sorry for the long question.
Thanks in advance!

token_embedding for non-text sequences

Hi Peter,

Thanks for the insightful blog on how to build transformers from scratch. I'm experiencing what's more likely to be a user error than an actual code issue and was hoping you could provide me with a pointer on how to go about it.

In brief, I'm trying to perform sequence classification on multi-feature, non-text sequences. Specifically, each sequence is 5 features by 100 timepoints large and has one label. The data points include discrete locations in 2D space, cf. positions on a chessboard, and are all integers. The main issue probably resides in the fact that I'm not presenting the data correctly. During the first forward pass of the training data, when generating the token embedding (tokens = self.token_embedding(x) in Transformer), I'm getting:

File "xxx/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

What's unclear to me is whether this issue is due to mismatching tensor sizes or that my particular dataset is incompatible with the typical use case of nn.Embedding. For completion, self.token_embedding is 176 by 5, i.e. the number of unique rows/tokens within the dataset (hypothetically, the vocabulary size) vs. the number of features (hypothetical embedding size). Any pointers would be much appreciated.

Best,
Arjen

AttributeError: module 'torch' has no attribute 'triu_indices'

When I run the code and set the mask to True, the following problem occurs:

AttributeError: module 'torch' has no attribute 'triu_indices'

def mask_(matrices, maskval=0.0, mask_diagonal=True):
b, h, w = matrices.size()
indices = torch.triu_indices(h, w, offset=0 if mask_diagonal else 1)
matrices[:, indices[0], indices[1]] = maskval

Hi,question about sliced-up version self-attention

in the blog you says there is a more efficient way of implementation? see lecture at the top. Do you mean the youtube vide at the top?

but there is no code explaination in the video , do i have to watch the video and implement myself or any blogs about the more efficient way of self attention?

thanks a lot!

Module import error

$ python experiments/classify.py

Traceback (most recent call last):
File "experiments/classify.py", line 1, in
import former
File "/home/yky/miniconda3/envs/former/lib/python3.7/site-packages/former/init.py", line 1, in
from .modules import SelfAttention, SelfAttentionWide, TransformerBlock, SelfAttentionRelative, SelfAttentionNarrow
File "/home/yky/miniconda3/envs/former/lib/python3.7/site-packages/former/modules.py", line 1, in
from .util import mask_, d, slice_diag
ModuleNotFoundError: No module named 'former.util'

ModuleNotFoundError: No module named 'past'

I am practicing pytorch and got an error as follows.

Traceback (most recent call last):
File "D:/Programing/tensorboardtest.py", line 1, in
from torch.utils.tensorboard import SummaryWriter
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard_init_.py", line 6, in
from .writer import FileWriter, SummaryWriter # noqa F401
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard\writer.py", line 18, in
from ._convert_np import make_np
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\torch\utils\tensorboard_convert_np.py", line 12, in
from caffe2.python import workspace
File "C:\Users\jjong\AppData\Roaming\Python\Python36\site-packages\caffe2\python\workspace.py", line 15, in
from past.builtins import basestring
ModuleNotFoundError: No module named 'past'

Can any expert help me to solve?

conda installation failed

Thanks for the nice post on transformer!

However, this repo seems failed to install with current conda.

ResolvePackageNotFound:
  - openssl==1.1.1c=h7b6447c_1
  - ncurses==6.1=he6710b0_1
  - libstdcxx-ng==9.1.0=hdf63c60_0
  - zlib==1.2.11=h7b6447c_3
  - libedit==3.1.20181209=hc058e9b_0
  - libffi==3.2.1=hd88cf55_4
  - tk==8.6.8=hbc83047_0
  - sqlite==3.29.0=h7b6447c_0
  - readline==7.0=h7b6447c_5
  - xz==5.2.4=h14c3975_4
  - python==3.7.4=h265db76_1
  - libgcc-ng==9.1.0=hdf63c60_0

Why the use of log_softmax ?

Trying to understand why you used log_softmaxinstead of softmax as it will not reproduce a probability distribution that sums to 1 in the end? I'm missing something?

Calculation of memory required in "Going big"

For a sequence length t, this is a dense matrix containing t2 elements. At standard 32-bit precision, and with t=1000 a batch of 16 such matrices takes up about 250Mb of memory.

Just a small question: how do we arrive at 250 Mb?

A single matrix would be 1 million elements, each of which is 4 bytes, is my understanding correct?
So a single matrix would require 4M bytes, or 4 Mb. And then we would need 16 of them for a batch, correct?
Then wouldn't this result into 64 Mb in total?

Just trying to check my understanding here. Thanks so much for the great article!

Question about the tool used to make the figures in the blog post

Hello, first thank you very much for the awesome post and the code. Clearly explained and illustrated.

I must say, the illustrations in the blog post are beautiful, I was wandering what tool you have used to make them.

Sorry for this unrelated question, feel free to close this issue if this is not the place to ask it.
Thanks.

Masking done for the upper or lower triangle?

Hi,

I believe the comment should be masking the upper not the lower triangle of the matrix? Both the blog and the code referred to the upper half of the matrix.

Thanks.

former/former/modules.py

Lines 58 to 59 in 7b12ae6

if self.mask: # mask out the lower half of the dot matrix,including the diagonal
mask_(dot, maskval=float('-inf'), mask_diagonal=False)

No Field in torchtext.data

Thank you for the super explanation of the transformer!

I am a beginner and trying to learn the code. I notice that bot torchtext.legacy.data and torchtext.data.Field are depreciated. I get the error message AttributeError: module 'torchtext.data' has no attribute 'Field'.

Is there a solution for this error?

I highly appreciate your feedback!

Einsum to avoid transpose and reshape

Hi,

Thank you for the great post about Transformers.
Actually, you can avoid transpose/reshape using torch.einsum.

Here is an example that behaves exactly as your implementation (except mask=True, and asserts 😊):

def forward_einsum(self, x):
    b, t, e = x.size()
    h = self.heads

    keys    = self.tokeys(x).view(b, t, h, e)
    queries = self.toqueries(x).view(b, t, h, e)
    values  = self.tovalues(x).view(b, t, h, e)

    dot = torch.einsum('bthe,bihe->bhti', queries, keys) / math.sqrt(e)
    dot = F.softmax(dot, dim=-1)

    out = torch.einsum('bhtd,bdhe->bthe', dot, values)

    # we can move reshape of weights to init; I left it here just to compare with the original implementation
    out = torch.einsum('bthe,khe->btk', out, self.unifyheads.weight.view(e,h,e)) 
    return out + self.unifyheads.bias

Despite code became very short it's probably hard to understand for people that don't know einsum notation, so apparently, this is definitely not the best code to explain the idea 😊.

Issue with masking

Thanks for your great tutorial. I think your code on github is probably find, but I believe there is an error in the mask as defined in the post.

I believe the masking line should read

indices = torch.triu_indices(t, t, offset=1)

In the current implementation, dot should be t x t. Furthermore, an offset of 0 will create a row of all -inf, which gives a complete row of NaNs when fed into softmax. See pytorch/pytorch#24816

Weight matrix dimensions after transofrmation

This might be a dumb question.
Could you explain the dimensions of the weight matrix after transforming the input with K,Q,V?
Screenshot 2020-12-21 161852

I tried doing it myself , am I overlooking an obvious error here?
Assuming a batch size of 1 for simplicity.
22573

Understanding multiheaded attention.

In the blog post you say we can implement multiheaded attention by splitting the input into chunks and processing each chunk separately. I tried to write down the equation for what this might look like. So say we have 4 vectors w, x, y, z. We use 3 heads so we can write x as x1, x2, x3 and so on for y, w and z. The attention weights for x_i, y_j is a_ij. So we can write
the transformed x after applying attention weights as. The capital X is simply [x1, x2, x3], and Axy is [ax1y1, ax2y2, ax3y3]
and so on and * represents element wise multiplication

x' = Axx * X + AxyY + Axw W + Axz* Z. This is with three heads

now if we werent splitting it into chunks we would do

x' = axx'*X + axy'*Y + axw'*W + axz'*Z with multiple heads so here say 3 (here small a means it is a scalar). Then we would have 3 such values and if we take the mean to get the values of x_final then we would get (axx'1 + axx'2 + axx'3) / 3 and hence we can write

x' = axx'm * X + axy'm * Y + axw'm * W + axz'm * Z where axy'm is (ax1y1m + ax2y2m + ax3y3m) / 3

So in the first case we are multiplying different dimensions of A with different weights while in the second case we are multiplying all dimensions of x by the same number. Can you explain this part please. I hope my question is clear

improve performance in deeper network using your multi head attention code

hi Mr. Peter.
Thanks for the insightful blog on how to build transformers from scratch. I am a master student. I used your code for my thesis. In my code, after the embedding layer (pretrained embedding + position embedding) I only used a multi-head attention layer and that was given to a Bi-LSTM network as input.
But the results do not improve due to the accuracy of the single Bi-LSTM model. What do you think is the reason for this problem? How do I fix it?
thanks.

Is narrow attention implemented correctly?

Hey man! Thanks so much for writing this amazing exposition on transformers. I understand things so much better because of you. And the einsum link just changed my life for ever.

I had a doubt regarding the Narrow Self-Attention implementation...

In former/modules.py, class SelfAttentionNarrow you have the following code for __init__:

former/former/modules.py

Lines 90 to 95 in 6a3295c

s = emb // heads
# - We will break the embedding into `heads` chunks and feed each to a different attention head
self.tokeys = nn.Linear(s, s, bias=False)
self.toqueries = nn.Linear(s, s, bias=False)
self.tovalues = nn.Linear(s, s, bias=False)

Now in the forward method you do:

former/former/modules.py

Lines 105 to 110 in 6a3295c

s = e // h
x = x.view(b, t, h, s)
keys = self.tokeys(x)
queries = self.toqueries(x)
values = self.tovalues(x)

Does this not mean that you only have one attention head rather than h of them? As all the h s-sized pieces of x are going through the same linear layer.
From my understanding there should be h Linear layers of size s x s each. Or a single weight matrix of shape h x s x s depending on your implementation.
Am I going wrong somewhere? It would be really awesome if you clarify.

ImportError: attempted relative import with no known parent package

Hi @pbloem,
Thank you for a great blog post. Running

python experiments/classify.py

on my regular conda environment gives the output

(MyDeeplearningEnv) PS C:\Users\Ruben\Desktop\former> python experiments/classify.py
Traceback (most recent call last):
  File "experiments/classify.py", line 1, in <module>
    from _context import former
  File "C:\Users\Ruben\Desktop\former\experiments\_context.py", line 7, in <module>
    import former
  File "C:\Users\Ruben\Desktop\former\former\__init__.py", line 1, in <module>
    from .modules import SelfAttention, SelfAttentionWide, TransformerBlock, SelfAttentionRelative, SelfAttentionNarrow
  File "C:\Users\Ruben\Desktop\former\former\modules.py", line 1, in <module>
    from former import util
  File "C:\Users\Ruben\Desktop\former\former\util\__init__.py", line 1, in <module>
    from .util import mask_, d, here, contains_nan, tic, toc, \
  File "C:\Users\Ruben\Desktop\former\former\util\util.py", line 6, in <module>
    import transformers as trf
  File "C:\Users\Ruben\Desktop\former\former\transformers.py", line 5, in <module>
    from .modules import TransformerBlock
ImportError: attempted relative import with no known parent package

I tried creating a clean environment using

pip install torch tb-nightly tqdm numpy torchtext
pip install future

This resulted in more fstring errors which I removed and then got the same error as above. Creating a environment from the environment.yml fails to resolve.

Any helps is appreciated.

How to tokenize a testing phrase

Hi everyone! Thanks a lot for this nice tutorial and code to learning transformers!.

I am trying to recreate the sample of the tutorial:

https://peterbloem.nl/blog/transformers

And I was able to train and serialize a model for the IMDB Dataset.

Currently, I want to test the model with new validation phrases. Nevertheless, I cannot find a way to tokenize the phrase into the required data shape, as in the provided sample:

#Load dataset
tdata, _ = datasets.IMDB.splits(TEXT, LABEL)
train, test = tdata.split(split_ratio=0.8)

#Preprocess data
TEXT.build_vocab(train, max_size=50_000 - 2)
LABEL.build_vocab(train)

#Create iterators
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=4, device=util.d())

I see that the tokens are generated in some part of the BucketIterator (or the dataset itself):

for batch in tqdm.tqdm(test_iter):

    input = batch.text[0]
    label = batch.label - 1

As in the dataset , I can see the phrases separated into words:

print(test_iter.data()[0].text)
print(test_iter.data()[0].label)

generates:

['i', "wouldn't", 'rent', 'this', 'one', 'even', 'on', 'dollar', 'rental', 'night.']
neg

So, if I want to test a pharse in the model. Like:

#Try the model
input = ["this", "movie", "is", "incredible", "boring"] 

How can I tokenize the word in a correct way to feed it into the model?.

Thanks in advance for your response.

Greetings!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.