instant-labs / instant-clip-tokenizer Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 2.67 MB

License: MIT License

Python 21.72% Rust 77.17% Makefile 1.12%

instant-clip-tokenizer's People

Contributors

Stargazers

Watchers

instant-clip-tokenizer's Issues

Create Python bindings

Expose the functionality of the Rust crate to Python, by creating a wrapper library using PyO3.

In addition to the methods on Tokenizer there should also be a convenience method that takes a list of strings, tokenizes all of them, adds <start_of_text> and <end_of_text> markers to each, truncates to a given context length, and writes the results in the rows of a numpy array. This corresponds to the tokenize convenience function from the Python library.

Open questions:

Should we also provide something similar to the decode convenience function from the Python library? We do not use this anywhere, and it does not even work on my computer (out of the box).
The Python library also does some pre-processing on the input text prior to encoding, namely replacing HTML entities and fixing mojibake. This has been left out of the Rust library intentionally. However, we might want to do it in the Python wrapper just before calling into Rust, in order to be a drop-in replacement for the Python library?

Make lowercasing input text responsibility of caller

Right now, the Tokenizer::encode method lowercases its input string. It is questionable whether this is the correct place to do this as it is some form of pre-processing, or whether this should be the responsibility of the caller (that is, one has to pass in lowercased text).

Option 1 (keep it in in Tokenizer::encode):

Pros: when using the standard vocabulary file it is necessary to use lowercase inputs, otherwise resulting tokenization is wrong
Cons: caller might already have lowercased the input, so in this (probably rare) case we're paying for an extra allocation

Option 2 (let the caller do it):

Pros: might be slightly more performant; no pre-processing done by tokenizer
Cons: users of the tokenizer must read the documentation to find out they have to provide lowercased input or will otherwise get subtly wrong results

Use custom error type for `Tokenizer::with_vocabulary`

Currently, the Tokenizer::with_vocabulary constructor returns a std::io::Error error, abusing the io::ErrorKind::Other variant for situations where the vocabulary data format is invalid.

We should probably provide a custom crate-level Error type and use that instead, with one variant for I/O-errors and further variants for invalid data formats.

Setup CI/CD

Some simple CI/CD setup is necessary. At least the following should be checked:

code compiles
tests are green
formatting is ok
Clippy is happy
Code compiles with MSRV

Rewrap docstrings at 100 columns

These seem to be consistently wrapping at 80 columns instead of the 100 we mostly use (matching the code).

Replace word-splitting regex with a custom parser

Instead of using a regular expression for splitting the input text into words (called Tokenizer::word_split) we could write a custom parser, thereby getting rid of the dependency on the regex crate, and also maybe improving performance slightly.

Compare tokenizations of Rust tokenizer with original Python code

We should compare the output of the tokenization (and also decoding) of the Rust implementation against the original Python code on some large dataset, to ensure the Rust implementation is 100% compatible.

Questions:

What dataset should we use for comparison? Maybe texts from the dataset CLIP was trained on?
Can we also run this comparison in CI? Might be too expensive, and might mean we need to cache the dataset somehow

Add fuzzing to crate

It would be good to have some simple fuzz tests, since the inputs to the tokenizer will often be untrusted user input.

In addition to just checking for panics we could also verify in the fuzz test that lowercase_and_strip_whitespace(decode(encode(input))) == lowercase_and_strip_whitespace(input) holds for every input where lowercase_and_strip_whitespace lowercases its input and removes all whitespace.

Cache tokenization results for individual words

The original Python library uses an internal dictionary (SimpleTokenizer#cache) where it stores the tokenization results for each word it has tokenized so far.
This obviously speeds things up for frequently occurring words, however, the drawback is that this cache is unbounded in size. Therefore, if the tokenizer is used to tokenize user-provided input (e.g. for a search API-endpoint) then a malicious user could feed the tokenizer with random sequences of characters, eventually resulting in an out-of-memory condition.
For this reason the Rust implementation does not yet use an internal cache.

We have a few options here:

Option 1: Simply don't provide a cache. Tokenization of individual words is already so fast that a cache might actually slow things down, especially if the cache implementation is more complex (i.e. not simply an unbounded AHashMap but instead something bounded like a LRU-cache)
Option 2: Let the user decide. We could define a trait Cache and change the Tokenizer::encode method to fn encode<C: Cache>(&self, text: &str, out: &mut Vec<Token>, cache: &mut C). We should probably also provide some simple implementations like NoCache (doesn't do anything) and UnboundedCache (just uses a AHashMap - equivalent to Python version). If a user wants something more fancy than that they can implement it for themselves.

instant-labs / instant-clip-tokenizer Goto Github PK

instant-clip-tokenizer's People

Contributors

Stargazers

Watchers

instant-clip-tokenizer's Issues

Create Python bindings

Make lowercasing input text responsibility of caller

Use custom error type for `Tokenizer::with_vocabulary`

Setup CI/CD

Rewrap docstrings at 100 columns

Replace word-splitting regex with a custom parser

Compare tokenizations of Rust tokenizer with original Python code

Add fuzzing to crate

Cache tokenization results for individual words

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent