Giter Site home page Giter Site logo

minbpe-hs's Introduction

minbpe-hs

Introduction
Installation
Usage
Tests
Parity With Python

Introduction

minbpe-hs is a port of Andrej Karpathy's concise byte-level byte pair encoding (BPE) implementation, minbpe, to Haskell. During training, BPE compresses a sequence by repeatedly finding the most frequent pair of elements in the input and merging them into a new token whilst maintaining a record of the merges and the tokens. Encoding is performed by recursively merging pairs with the smallest merge index until there remain no pairs to be merged, and encoded representations can be decoded backed into the original through vocabulary look-ups. Given its recursive nature, BPE lends itself naturally to functional programming, and the objective of this repository is to offer an alternative functional formulation of it that is orthogonal to typical implementations in imperative languages such as Python or Rust to aid learners in gaining a firmer grasp on BPE by viewing it through a different lens.

Similar to minbpe, BPE.Basic and BPE.Regex both supply the same fundamental functionalities of training, encoding, and decoding, but the former treats a text corpus as one continuous sequence whereas the latter, in addition to supporting special tokens, divides it into smaller segments using a regex pattern and considers merges only within these segments. BPE.Base contains core operations employed by the two tokenizers and also provides storing & loading utilities. There is no counterpart to minbpe's GPT4Tokenizer as tiktoken does not expose a Haskell frontend.

Installation

Please run cabal install to install minbpe-hs. Alternatively, you can pass the source directory of this repository to GHC via the -i flag when compiling code that uses minbpe-hs (e.g., ghc -isrc MyProgram.hs).

Usage

BPE's Wikipedia entry discusses a simple example that can be reproduced using minbpe-hs as follows.

{-# LANGUAGE OverloadedStrings #-}

import BPE.Base
import BPE.Basic

main :: IO ()
main = do
    let (merges, vocab) = trainTokenizer (256 + 3) "aaabdaaabac" -- 256 initial tokens (bytes), 3 merges
    putStrLn $ show $ encode merges "aaabdaaabac" -- [258, 100, 258, 97, 99]
    putStrLn $ show $ decode vocab [258, 100, 258, 97, 99] -- "aaabdaaabac"
    saveMergesAndVocab "toy" merges vocab -- Saves toy.merges and toy.vocab

More examples can be found in the examples/ folder. On average, the training portions of the examples run 2-3x faster than minbpe, and encoding long strings of text can be 5-6x times faster.

Tests

The test suite can be run using cabal test or by directly executing tests/Main.hs, assuming the test dependencies have already been installed.

Parity With Python

Generally, the results of minbpe-hs are identical to those of the equivalent Python implementation (i.e., minbpe), with one major exception: regex-pcre, the Haskell regex package used by this project, does not properly handle non-ASCII characters in that it matches the UTF-8-encoded bytes of a single Unicode character separately, so the output of BPE.Regex would be incompatible with that of minbpe's RegexTokenizer should the input not be purely ASCII.

minbpe-hs's People

Contributors

bobmcdear avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.