Giter Site home page Giter Site logo

bkhanal-11 / transformers Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 1.0 298 KB

The implementation of transformer as presented in the paper "Attention is all you need" from scratch.

License: MIT License

Python 61.20% Jupyter Notebook 38.80%
attention-is-all-you-need attention-mechanism multihead-attention positional-encoding self-attention transformers

transformers's Introduction

Attention is All You Need

The implementation of transformer as presented in the paper "Attention is all you need" from scratch.

Excellent Illustration of Transformers: Illustrated Guide to Transformers Neural Network: A step by step explanation

Keys, Queries and Values in Attention Mechanism: What exactly are keys, queries, and values in attention mechanisms?

Positional Encoding: Transformer Architecture: The Positional Encoding

Data Flow, Parameters, and Dimensions in Transformer: Into The Transformer, Transformers: report on Attention Is All You Need

The Transformer architecture is a popular type of neural network used in natural language processing (NLP) tasks, such as machine translation and text classification. It was first introduced in a paper by Vaswani et al. in 2017.

At a high level, the Transformer model consists of an encoder and a decoder, both of which contain a series of identical layers. Each layer has two sub-layers: a self-attention layer and a feedforward layer. The self-attention layer allows the model to attend to different parts of the input sequence, while the feedforward layer applies a non-linear transformation to the output of the self-attention layer.

Architecture

Now, let's break down the math behind the self-attention layer. Suppose we have an input sequence of length $N$, represented as a matrix $X$, where each row corresponds to a word embedding. We want to compute a new sequence of vectors $Z$, where each vector is a weighted sum of all the input vectors:

$$ Z = XW $$

However, we want to compute the weights dynamically, based on the similarity between each pair of input vectors. This is where self-attention comes in. We first compute a "query" vector $Q$, a "key" matrix $K$, and a "value" matrix $V$:

$$ Q = XW_q \\ K = XW_k \\ V = XW_v $$

where $W_q$, $W_k$, and $W_v$ are learned weight matrices. Then, we compute the attention weights as a softmax function of the dot product between $Q$ and $K$:

$$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{Q K^{T}}{\sqrt{d_k}})V $$

where $d_k$ is the dimensionality of the key vectors. The softmax function ensures that the attention weights sum to $1$, and the scaling factor of $\frac{1}{\sqrt{d_k}}$ helps stabilize the gradients during training.

Finally, we compute the output of the self-attention layer as a weighted sum of the value vectors:

$$ Z = \text{Attention}(Q, K, V) W_o $$

where $W_o$ is another learned weight matrix. The output of the self-attention layer is then passed through a feedforward layer with a ReLU activation function, and the process is repeated for each layer in the encoder and decoder.

Multi-Head Attention

Overall, the Transformer architecture is a powerful tool for NLP tasks, and its self-attention mechanism allows it to model long-range dependencies in the input sequence.

transformers's People

Contributors

bkhanal-11 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

saintbenjamin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.