Giter Site home page Giter Site logo

bl0nder / makespeare Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 0.0 8.36 MB

Repository for a transformer I coded from scratch and trained on the tiny-shakespeare dataset.

License: MIT License

Python 100.00%
deep-learning neural-networks pytorch text-generation transformer-architecture transformer-encoder transformer-from-scratch

makespeare's Introduction

πŸ–‹οΈ Makespeare

Makespeare is a GPT-style transformer that I coded from scratch and trained on the tiny-shakespeare dataset. This idea is inspired by Andrej Karpathy's video (https://youtu.be/kCc8FmEb1nY) which I used a reference only to overcome certain obstacles.

πŸ› οΈ Tools

πŸ“‘ Data

The transformer was trained on the tiny-shakespeare dataset containing 40,000 lines of text from Shakespeare's plays. Click here for the dataset.

An excerpt from the dataset:

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.
...

πŸ—οΈ Transformer Architecture

Input Embedding

For the transformer to be able to interpret text, we need to convert the input text into something a computer can understand - ✨Numbers✨. This is done by:

1. Tokenisation

  • Splitting up text into multiple parts or tokens

2. Encoding:

  • Giving a unique numerical ID to each unique token
  • Thus, every unique word is mapped to a unique numerical ID.
  • In practice, a dictionary is used to keep track of the ID of each word. The number of word-ID pairs present in the dictionary is known as its vocabulary size (referred to as vocab_size in the code).
Word ID
Cat 1
Dog 2
... ...
  • If a word that is not present in the dictionary is encountered, special rules are followed to assign an ID to it.

3. Vectorisation:

  • Converting each token into a learnable n-dimensional vector
  • For example, how similar two words are can be measured by the distance between their corresponding points in n-dimensional space (similarity increases the closer the points are).
  • The dimension of each such vector is fixed and corresponds to embedding_len in the code. Some sources also refer to this as d_model (model dimension).

Positional Embedding

Matrix of learnable vectors that represent the respective position of each token in a sentence.

Such embeddings allow the transformer to learn how words need to be in a certain order to make sense in a sentence.


Encoder (Not Used in Model)

Multi-Head Dot Product Self Attention

Mechanism through which the model can learn which words to focus on in a particular sentence. Attention is computed by:

  1. Generating 3 matrices, namely, the Query (Q), Key (K) and Value (V) as follows:

$$Q = X \cdot W_Q$$

$$K = X \cdot W_K$$

$$V = X \cdot W_V$$

where

$X =$ Matrix contaning input embedding vectors πŸ‘‰ (context_length, embedding_len)

$W_Q, W_K, W_V =$ Matrices with separately learnable weights πŸ‘‰ (embedding_len, embedding_len)

  1. Splitting the Query, Key and Value matrices into num_head heads
  2. Computing attention on each head as follows:

$$Attention(Q,K,V) = Softmax(\frac{QK^T}{\sqrt d_k})V$$

where

$d_k =$ Dimension of each vector in $K$

  1. Concatenating the attention values calculated for each head into a single attention matrix

Residual Connection & Normalisation

The computed attention matrix is added to the attention block's input matrix. This is known as a residual connection.

The residual output then undergoes normalisation for better and faster training.

graph BT;
id1(Input) --> id2(Attention)
id1(Input) --> id3((+))
id2(Attention) --> id3((+))
id3((+)) --> id4(Residual)
id4(Residual) --> id5(Normalisation)
style id1 fill:#005f73, stroke:#005f73
style id2 fill:#0a9396, stroke:#0a9396
style id3 fill:#ca6702, stroke:#ca6702
style id4 fill:#bb3e03, stroke:#bb3e03
style id5 fill:#9b2226, stroke:#9b2226
Loading

Note: Makespeare makes use of a slightly modified version of this step wherein the attention block's input matrix undergoes normalisation, the attention matrix is computed using this normalised input, and finally, the residual computation is performed. This is known as pre-normalisation and is simply a rearrangement of the aforementioned order of steps as follows:

graph BT;
id1(Input) --> id5(Normalisation)
id1(Input) --> id3((+))
id2(Attention) --> id3((+))
id3((+)) --> id4(Residual)
id5(Normalisation) --> id2(Attention)
style id1 fill:#005f73, stroke:#005f73
style id2 fill:#0a9396, stroke:#0a9396
style id3 fill:#ca6702, stroke:#ca6702
style id4 fill:#bb3e03, stroke:#bb3e03
style id5 fill:#9b2226, stroke:#9b2226
Loading

Feedforward Neural Network

The output from the previous step is fed to a feedforward neural network.

graph BT;
id1(Attention_Output) --> id2(Normalisation)
id2(Normalisation) --> id3(Feedforward_NN)
id3(Feedforward_NN) --> id4((+))
id1(Attention_Output) --> id4((+))
id4([+]) --> id5(Residual)
style id1 fill:#005f73, stroke:#005f73
style id2 fill:#0a9396, stroke:#0a9396
style id3 fill:#ca6702, stroke:#ca6702
style id4 fill:#bb3e03, stroke:#bb3e03
style id5 fill:#9b2226, stroke:#9b2226
Loading

Decoder

Similar to GPT, Makespeare is a decoder-only transformer.

Multi Head Causal Self Attention

Attention mechanism similar to dot product self attention with the only difference being that query values are not given access to any succeeding key-value pairs. In other words, no future tokens are accessed by the decoder while predicting the current token.

A (context_length, context_length) mask is used to accomplish this. The mask is set to $-\infty$ at positions corresponding to future tokens.

$$Mask = \begin{bmatrix} 1 & -\infty & -\infty\\1 & 1 & -\infty\\1 & 1 & 1 \end{bmatrix}, \\QK^T = \begin{bmatrix} a & b & c\\d & e & f\\g & h & i \end{bmatrix}$$ $$QK^T + Mask = \begin{bmatrix} a & -\infty & -\infty\\d & e & -\infty\\g & h & i \end{bmatrix},$$

makespeare's People

Contributors

bl0nder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.