Giter Site home page Giter Site logo

ubospica / tokenizers-cpp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mlc-ai/tokenizers-cpp

0.0 0.0 0.0 40 KB

Universal cross-platform tokenizers binding to HF and sentencepiece

License: Apache License 2.0

Shell 3.58% JavaScript 2.68% C++ 43.77% Python 1.10% C 3.64% Rust 16.78% TypeScript 10.10% HTML 0.50% CMake 17.85%

tokenizers-cpp's Introduction

tokenizers-cpp

This project provides a cross-platform C++ tokenizer binding library that can be universally deployed. It wraps and binds the HuggingFace tokenizers library and sentencepiece and provides a minimum common interface in C++.

The main goal of the project is to enable tokenizer deployment for language model applications to native platforms with minimum dependencies and remove some of the barriers of cross-language bindings. This project is developed in part with and used in MLC LLM. We have tested the following platforms:

  • iOS
  • Android
  • Windows
  • Linux
  • Web browser

Getting Started

The easiest way is to add this project as a submodule and then include it via add_sub_directory in your CMake project. You also need to turn on c++17 support.

  • First, you need to make sure you have rust installed.
  • If you are cross-compiling make sure you install the necessary target in rust. For example, run rustup target add aarch64-apple-ios to install iOS target.
  • You can then link the library

See example folder for an example CMake project.

Example Code

// - dist/tokenizer.json
void HuggingFaceTokenizerExample() {
  // Read blob from file.
  auto blob = LoadBytesFromFile("dist/tokenizer.json");
  // Note: all the current factory APIs takes in-memory blob as input.
  // This gives some flexibility on how these blobs can be read.
  auto tok = Tokenizer::FromBlobJSON(blob);
  std::string prompt = "What is the capital of Canada?";
  // call Encode to turn prompt into token ids
  std::vector<int> ids = tok->Encode(prompt);
  // call Decode to turn ids into string
  std::string decoded_prompt = tok->Decode(ids);
}

void SentencePieceTokenizerExample() {
  // Read blob from file.
  auto blob = LoadBytesFromFile("dist/tokenizer.model");
  // Note: all the current factory APIs takes in-memory blob as input.
  // This gives some flexibility on how these blobs can be read.
  auto tok = Tokenizer::FromBlobSentencePiece(blob);
  std::string prompt = "What is the capital of Canada?";
  // call Encode to turn prompt into token ids
  std::vector<int> ids = tok->Encode(prompt);
  // call Decode to turn ids into string
  std::string decoded_prompt = tok->Decode(ids);
}

Extra Details

Currently, the project generates three static libraries

  • libtokenizers_c.a: the c binding to tokenizers rust library
  • libsentencepice.a: sentencepiece static library
  • libtokenizers_cpp.a: the cpp binding implementation

If you are using an IDE, you can likely first use cmake to generate these libraries and add them to your development environment. If you are using cmake, target_link_libraries(yourlib tokenizers_cpp) will automatically links in the other two libraries. You can also checkout MLC LLM for as an example of complete LLM chat application integrations.

Javascript Support

We use emscripten to expose tokenizer-cpp to wasm and javascript. Checkout web for more details.

Acknowledgements

This project is only possible thanks to the shoulders open-source ecosystems that we stand on. This project is based on sentencepiece and tokenizers library.

tokenizers-cpp's People

Contributors

bbuf avatar charliefruan avatar erjanmx avatar hzfengsy avatar junrushao avatar luiyen avatar manuongithub avatar songkq avatar tqchen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.