Giter Site home page Giter Site logo

tokenizer's Introduction

Tokenizer

Tokenizer is a Rust project inspired by OpenAI, aimed at providing a basic implementation of the Byte Pair Encoding (BPE) algorithm. This project serves as a learning opportunity for Rust enthusiasts, particularly those interested in the field of artificial intelligence.

Introduction

Byte Pair Encoding (BPE) is a popular technique used in natural language processing (NLP) tasks, particularly in tokenization. It involves iteratively merging the most frequent pair of symbols in a corpus, effectively learning subword units that are useful for various NLP tasks.

Tokenizer implements the BPE algorithm in Rust, providing a foundation for further exploration and experimentation in tokenization and NLP.

Features

  • BPE Implementation: Provides a basic implementation of the Byte Pair Encoding algorithm in Rust.
  • Extensible: Designed with modularity in mind, allowing for easy expansion with additional modules for different tokenization techniques, such as GPT-2 tokenization.
  • MIT License: Released under the MIT License, enabling anyone to use, modify, and distribute the project freely.

Usage

  1. Clone the Repository:

    git clone https://github.com/usama3627/tokenizer.git
  2. Install Rust: Ensure that you have Rust installed on your system. You can install it using rustup.

  3. Build and Run:

    cd tokenizer
    cargo build
    cargo run
  4. Training Data: To train the tokenizer, download the training dataset from the provided Link (huggingface dataset) and rename it to myresponse.json. Place the dataset in the project directory. For testing, I am using 100 rows of tweets.

Dependencies

  • serde = { version = "1.0.104", features = ["derive"] }
  • serde_json = "1.0.48"

Ensure these dependencies are specified in your Cargo.toml file.

Future Work

In future iterations of the project, the following enhancements can be considered:

  • Modularization: Refactor the code into modules, separating concerns such as BPE implementation and GPT-2 tokenization.
  • Optimizations: Explore optimizations to improve the performance of the tokenization process.
  • Documentation: Enhance documentation to provide detailed explanations of the algorithms and codebase.
  • Additional Tokenization Techniques: Integrate additional tokenization techniques to provide a comprehensive toolkit for NLP tasks.

License

Tokenizer is released under the MIT License. Feel free to use, modify, and distribute the project according to the terms of the license.

Contributors

Contributions to the project are welcome. Fork the repository, make your changes, and submit a pull request.

Acknowledgements

Tokenizer was inspired by the work of OpenAI and Andrej Karpathy and aims to contribute to the Rust and AI communities.

tokenizer's People

Contributors

usama3627 avatar

Stargazers

Saeed Ahmad avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.