Giter Site home page Giter Site logo

Comments (11)

alasdairforsythe avatar alasdairforsythe commented on May 29, 2024 7

This coming update (it's complete, except for waiting on the vocabularies being built and the Python implementation), is a major update. All the vocabularies now prefer to split words with space in front, and I'm seeing at least 10% gains as well as increased consistency. The "balanced" vocabularies actually perform better on English than the "compressed" vocabularies, which appears to be because it reduces the amount of tokens that are being allocated to various different combinations of words and symbols with spaces, tabs and newlines due to a significant part of the English dataset being code. Obviously that also increases token count for code significantly but it's much more consistent (and that's optional anyway.)

I've also sped up the speed of tokenization by 15x on Go, and 60x on Python. It runs about as fast as it can loop through the text and do the branching logic for 3 branches, the actual finding of tokens is optimized away. So this new version will be several times faster than tiktoken.

from tokenmonster.

alasdairforsythe avatar alasdairforsythe commented on May 29, 2024 3

Hi, thanks for writing that detailed and well thought out suggestion.

I feel a little bad that you wrote that out so nicely and my answer is "it's already done"... in the new version, which I will release this week.

What I did is implement a 'D' deleteToken. In simple terms, any sequence of letters or numbers that are not already preceded by a space have "D " inserted in front. That means all words now consistently begin with a space. The tokenizer can optimize as it likes, but I'm nudging it by giving it the rule that it should side with optimizing towards "space letter" or "space number" beginnings whenever there are 2 or more branches of equal weight.

I've also added 4 modes: 0 = maximize chr/tok, 1 = balanced, 2 = consistent, 3 = maximize consistency (required)
Anything greater than 0 will trim tokens that don't follow certain rules. At maximum consistency it'll essentially be forced to have only 1 token for the same word, instead of allowing it to be merged in with punctuation, etc.

I'll have a think about whether a joiner will be useful. It's past midnight and my brain is refusing to think properly, so I'll have a think about it tomorrow.

from tokenmonster.

alasdairforsythe avatar alasdairforsythe commented on May 29, 2024 1

It seems to be like this:

  • If you assume a space, you need a token to undo it
  • If you don't assume it, you need a token to add it

It can be normalized so there are spaces:

  • always before
  • always after
  • never
  • both

The most optimal will be either before or after, but not both or neither because there is only one space between words.

The solution then is to normalize for one direction. This is me thinking out loud. I do wish there were a neat solution to this that enables there to be one version of a word without requiring additional tokens to either join or space them. The level 3 mode will result it separate word/prefix and suffix tokens, which is not great for compression but it will be easy to learn for the language model.

from tokenmonster.

alasdairforsythe avatar alasdairforsythe commented on May 29, 2024 1

The new version (1.0) is finally released. No prebuilt vocabularies yet though, the first few of those will be up tomorrow.

from tokenmonster.

kosiakk avatar kosiakk commented on May 29, 2024

Practically speaking, there will be two types of tokens: words and punctuation. We can also call them alphabetic and symbolic.

Symbolic is spelled precisely, and preserves all the whitespace: ! , !, and ! are all different tokens, if optimizer decides so.
No additional whitespace is added before or after symbols. I'd guess, Chinese hieroglyphs would also fall into this category, because they don't use spaces or capitalization.

Alphabetic tokens are subject to Capcode and the proposed Spacecode.
Tokenizer expects one space between pairs of alphabetic tokens, and token-to-string decoder would insert them by default.

The default case is one space between words.
If two words are unconventionally separated with zero, two, three or more spaces, there will be special tokens between them:

Input Tokens Comment
helloworld hello+world Unusual concatenation requires additional tokens
hello world hello world Most common default case is just 2 tokens
hello world hello world Two spaces
hello world hello world Three spaces
hello, world hello , world Symbolic token contains all the spaces it needs

Just like in Capcode, we might foresee modifier tokens:

  • + single concatenation of two tokens
  • [+ and +] to concatenate longer words, but its usefulness needs to be tested.

My guess, is that it would improve compression, and semantic clarity of most of English texts.

from tokenmonster.

alasdairforsythe avatar alasdairforsythe commented on May 29, 2024

The issue with all methods is what to do with words made of multiple tokens. The current version prefers to tokenize with a space after words. I would have liked to go with that and implement with a backspace token, but the issue there is that it will mess with the decoding since a future token could affect a previous token. That's why I had to go with a forward delete.

By assuming a space and adding tokens for non-space, it would mean (almost) doubling the number of tokens required to represent multi-token words. The D decodeToken adds the opportunity for consistency but the optimizer doesn't have to take it, and it doesn't always (I can see in my tests.) Partly because tokens that don't begin with a space can be used as parts of longer words as well as in individual words.

The current version method of letting it choose anything has the issue that it produces, as you showed, multiple versions of words. That makes take longer (require more training) for a small language model to learn. My solution is to offer those different "modes" so the user can choose whether they prefer compression or consistency. But it would be nice if there were a one size fits all solution.

from tokenmonster.

kosiakk avatar kosiakk commented on May 29, 2024

If you assume a space, you need a token to undo it
If you don't assume it, you need a token to add it

Exactly! The idea is to assume a space (between pairs of specific tokens) and utilize a dedicated token to remove it when necessary.

normalize for one direction

OpenAI adopted a similar approach:
hello world,world is encoded as hello, world, ,, world, where the two "world" tokens are completely different because of the space. On top of that, World and World are two more tokens, resulting in 4 distinct tokens for what humans might consider single word, just placed differently.

helloworld becomes hell ow orld, which will make German language much harder to tokenize (they omit spaces quite often!)

always before

Yes, OpenAI uses the "space-ish" symbol Δ , representing whitespace or word boundaries, stored in the vocab.json. This symbol is then cleaned up during decoding.

always before, after, never, or both
because there is only one space between words

Only never looks reasonable to me - space is just not a part of the word! It's a technical detail, which must be handled by the tokenizer. All programming languages agree with this view - they discard whitespace and comments as soon as possible (Python uses indentation level to replace curly braces, but still ignores spaces around =). Decoding is similar to automatic code formatting - it would insert spaces appropriately and automatically.

Of course, we need exceptions for written natural languages. That's why new line and two spaces tokens exist, and that's why I'm proposing no space token as well.

My motivation is not the zip-style compression, but rather enhancing the AI model's efficiency by maintaining the semantic relationship between tokens.

from tokenmonster.

kosiakk avatar kosiakk commented on May 29, 2024

Shall we run an experiment?

  1. Tokenize a corpus with a decent vocab.
  2. Patch the vocab by trimming all the spaces from alphabetic tokens (and remove collisions).
  3. Redo the same corpus.
    Current tokenizer will be forced to insert between all words, because none of trimmed tokens would have a space before or after.
  4. Simulate the proposal:
    4.1. How many single-space tokens would be removed (assumed space between words by default)
    4.2. How many word-joiner tokens would be added (concatenated word pieces)

This gives us the lower estimate of the possible gains.
In practice, the improvement will be better because patched vocab will be ΒΌ smaller.

from tokenmonster.

alasdairforsythe avatar alasdairforsythe commented on May 29, 2024

I appreciate your input and I'd be happy for the collaboration.

In terms of compression it won't be better, because the vocabularies are already optimal. If you run the training multiple times it generates the same vocabulary, give or take a few of the borderline tokens.

What makes a difference to the final vocabulary is changing the tokenization algorithm greediness, and the normalization in some cases, and of course the vocabulary size.

I'm saying that to lead up to an important point here, which is that 100256 is a whole different beast than, let's say 32000. 100256 is so big it has spare tokens and the optimizer has plenty of available tokens for making different versions of words merged with punctuation, or with spaces on different sides. That's not true of 32000. At 32000, it's feeling the pressure.

So any analysis you're doing is best done on 32000, which is the ideal one-fits-all model size in my opinion.

To implement your suggestion, it's best done on the tokens that come out of getalltokens before the training. Otherwise you're comparing two different vocabulary sizes. However it might be quite similar to the difference between mode 0 and mode 3 in the new version, I'd suggest you wait for that first.

Another issue with a merge token is that it adds a lot of complexity to the tokenization. It'd be massively more complicated and run several times slower. That's because it needs to check for not just the next token but the possibility of a merge in various positions. The easiest/most-efficient way to implement that would be to add the spaces to the tokens and dataset during training and then swap that over at the end, but it's just the same as using extra space and delete as I'm already doing (but the other way around.)

from tokenmonster.

kosiakk avatar kosiakk commented on May 29, 2024

Ok, let's wait for your upcoming results with the delete token!

I've checked all available sizes of English:

Vocab Alphabetic ...of Vocab Distinct after trim Duplicates ...of Distinct ...of Vocab
100256 67569 67% 55579 11990 22% 12%
65536 45148 69% 36969 8179 22% 12%
50256 35129 70% 28620 6509 23% 13%
32000 22896 72% 18509 4387 24% 14%
24000 17499 73% 14106 3393 24% 14%

Rules:

  • Token is alphabetic if it contains only letters a..z, C, spaces, ,, ., or an ', and at least one letter.
  • Distinct after trim is counting alphabetic tokens after removing all front and back whitespace
  • Duplicates are the difference

Fractions remain consistent for different vocab sizes. If anything, I'd say tokenizer can be improved even more for smaller vocab sizes.

the vocabularies are already optimal

Yes, certainly.
I also don't suggest to change any rules of optimizer.

changing the tokenization algorithm greediness, and the normalization

Yes, exactly that. It feels like the whitespace normalization could improve compression: a quarter of alphabetic tokens can be replaced with a new rule in tokenizer, and the TokenMonster could nom-nom-nom 12-14% more tokens for other needs.
Given new tokenizer rule, the vocab optimizer will be affected and could do even better job.

The specific way to do the space normalization seems different in our minds, but I agree - eventually it must lead to the same result. I'm looking forward :-)
Thank you for a great work!

from tokenmonster.

kosiakk avatar kosiakk commented on May 29, 2024

That's brilliant, thank you! I can't wait to see the new vocab :-)

from tokenmonster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.