Giter Site home page Giter Site logo

nano-gpt's Introduction

Nano-GPT

We break the video tutorial by Andrej Karpathy on building a Generative Pre-trained Transformer (GPT) from scratch into 10 sections. The tutorial follows ideas covered in Attention is All You Need

Using PyTorch, we begin by coding a single layer NN bigram model that takes as input a single token and generates the next token. Then, we progress in complexity all the way to a decoder having 6 Transformer layers. The final model can take a sequence as input and generates the next token.

Key concepts

  1. Self-Attention Mechanism:

    • The core innovation of the Transformer is the self-attention mechanism. It allows the model to weigh the importance of different parts of the input sequence when processing a particular token. This mechanism replaces the need for fixed-length context windows or recurrence. Because we construct a decoder, only previous parts of the sequence (before the current token) are considered by the self-attention mechanism.

    • Scaled Dot-Product Attention: Self-attention calculates attention scores by taking the dot product of a query vector with key vectors and scaling it. Then, it applies a softmax function to obtain attention weights. These weights are used to compute a weighted sum of the value vectors.

  2. Multi-Head Attention:

    • Multi-head attention captures different types of relationships in the input data. Instead of using a single attention mechanism, multiple mechanisms (heads) are used in parallel.
    • Each head learns a different representation of the input sequence and contributes to the final output.
  3. Positional Encoding:

    • Since the Transformer model does not have built-in positional information (unlike sequential models like RNNs), positional encodings are added to the input embeddings to give the model information about the position of tokens in the sequence.
  4. Feed-Forward Neural Networks:

    • After the self-attention mechanism, each sub-layer in the decoder contains a feed-forward neural network.
  5. Residual Connections and Layer Normalization:

    • To facilitate training deep networks, residual connections are used, and layer normalization is applied after each sub-layer.

Technical Note

The scripts were run on a MacBook Pro M2 Max, so the device is set to "mps" to take advantage of the GPU. The scripts include a code snippet to accommodate "cuda" or "cpu".

nano-gpt's People

Contributors

frogstar-world-b avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.