Giter Site home page Giter Site logo

instant-clip-tokenizer's People

Contributors

djc avatar michael-p avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

instant-clip-tokenizer's Issues

Create Python bindings

Expose the functionality of the Rust crate to Python, by creating a wrapper library using PyO3.

In addition to the methods on Tokenizer there should also be a convenience method that takes a list of strings, tokenizes all of them, adds <start_of_text> and <end_of_text> markers to each, truncates to a given context length, and writes the results in the rows of a numpy array. This corresponds to the tokenize convenience function from the Python library.

Open questions:

  • Should we also provide something similar to the decode convenience function from the Python library? We do not use this anywhere, and it does not even work on my computer (out of the box).
  • The Python library also does some pre-processing on the input text prior to encoding, namely replacing HTML entities and fixing mojibake. This has been left out of the Rust library intentionally. However, we might want to do it in the Python wrapper just before calling into Rust, in order to be a drop-in replacement for the Python library?

Make lowercasing input text responsibility of caller

Right now, the Tokenizer::encode method lowercases its input string. It is questionable whether this is the correct place to do this as it is some form of pre-processing, or whether this should be the responsibility of the caller (that is, one has to pass in lowercased text).

Option 1 (keep it in in Tokenizer::encode):

  • Pros: when using the standard vocabulary file it is necessary to use lowercase inputs, otherwise resulting tokenization is wrong
  • Cons: caller might already have lowercased the input, so in this (probably rare) case we're paying for an extra allocation

Option 2 (let the caller do it):

  • Pros: might be slightly more performant; no pre-processing done by tokenizer
  • Cons: users of the tokenizer must read the documentation to find out they have to provide lowercased input or will otherwise get subtly wrong results

Use custom error type for `Tokenizer::with_vocabulary`

Currently, the Tokenizer::with_vocabulary constructor returns a std::io::Error error, abusing the io::ErrorKind::Other variant for situations where the vocabulary data format is invalid.

We should probably provide a custom crate-level Error type and use that instead, with one variant for I/O-errors and further variants for invalid data formats.

Setup CI/CD

Some simple CI/CD setup is necessary. At least the following should be checked:

  • code compiles
  • tests are green
  • formatting is ok
  • Clippy is happy
  • Code compiles with MSRV

Replace word-splitting regex with a custom parser

Instead of using a regular expression for splitting the input text into words (called Tokenizer::word_split) we could write a custom parser, thereby getting rid of the dependency on the regex crate, and also maybe improving performance slightly.

Compare tokenizations of Rust tokenizer with original Python code

We should compare the output of the tokenization (and also decoding) of the Rust implementation against the original Python code on some large dataset, to ensure the Rust implementation is 100% compatible.

Questions:

  • What dataset should we use for comparison? Maybe texts from the dataset CLIP was trained on?
  • Can we also run this comparison in CI? Might be too expensive, and might mean we need to cache the dataset somehow

Add fuzzing to crate

It would be good to have some simple fuzz tests, since the inputs to the tokenizer will often be untrusted user input.

In addition to just checking for panics we could also verify in the fuzz test that lowercase_and_strip_whitespace(decode(encode(input))) == lowercase_and_strip_whitespace(input) holds for every input where lowercase_and_strip_whitespace lowercases its input and removes all whitespace.

Cache tokenization results for individual words

The original Python library uses an internal dictionary (SimpleTokenizer#cache) where it stores the tokenization results for each word it has tokenized so far.
This obviously speeds things up for frequently occurring words, however, the drawback is that this cache is unbounded in size. Therefore, if the tokenizer is used to tokenize user-provided input (e.g. for a search API-endpoint) then a malicious user could feed the tokenizer with random sequences of characters, eventually resulting in an out-of-memory condition.
For this reason the Rust implementation does not yet use an internal cache.

We have a few options here:

  • Option 1: Simply don't provide a cache. Tokenization of individual words is already so fast that a cache might actually slow things down, especially if the cache implementation is more complex (i.e. not simply an unbounded AHashMap but instead something bounded like a LRU-cache)
  • Option 2: Let the user decide. We could define a trait Cache and change the Tokenizer::encode method to fn encode<C: Cache>(&self, text: &str, out: &mut Vec<Token>, cache: &mut C). We should probably also provide some simple implementations like NoCache (doesn't do anything) and UnboundedCache (just uses a AHashMap - equivalent to Python version). If a user wants something more fancy than that they can implement it for themselves.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.