Comments (11)
This coming update (it's complete, except for waiting on the vocabularies being built and the Python implementation), is a major update. All the vocabularies now prefer to split words with space in front, and I'm seeing at least 10% gains as well as increased consistency. The "balanced" vocabularies actually perform better on English than the "compressed" vocabularies, which appears to be because it reduces the amount of tokens that are being allocated to various different combinations of words and symbols with spaces, tabs and newlines due to a significant part of the English dataset being code. Obviously that also increases token count for code significantly but it's much more consistent (and that's optional anyway.)
I've also sped up the speed of tokenization by 15x on Go, and 60x on Python. It runs about as fast as it can loop through the text and do the branching logic for 3 branches, the actual finding of tokens is optimized away. So this new version will be several times faster than tiktoken.
from tokenmonster.
Hi, thanks for writing that detailed and well thought out suggestion.
I feel a little bad that you wrote that out so nicely and my answer is "it's already done"... in the new version, which I will release this week.
What I did is implement a 'D' deleteToken. In simple terms, any sequence of letters or numbers that are not already preceded by a space have "D " inserted in front. That means all words now consistently begin with a space. The tokenizer can optimize as it likes, but I'm nudging it by giving it the rule that it should side with optimizing towards "space letter" or "space number" beginnings whenever there are 2 or more branches of equal weight.
I've also added 4 modes: 0 = maximize chr/tok, 1 = balanced, 2 = consistent, 3 = maximize consistency (required)
Anything greater than 0 will trim tokens that don't follow certain rules. At maximum consistency it'll essentially be forced to have only 1 token for the same word, instead of allowing it to be merged in with punctuation, etc.
I'll have a think about whether a joiner will be useful. It's past midnight and my brain is refusing to think properly, so I'll have a think about it tomorrow.
from tokenmonster.
It seems to be like this:
- If you assume a space, you need a token to undo it
- If you don't assume it, you need a token to add it
It can be normalized so there are spaces:
- always before
- always after
- never
- both
The most optimal will be either before or after, but not both or neither because there is only one space between words.
The solution then is to normalize for one direction. This is me thinking out loud. I do wish there were a neat solution to this that enables there to be one version of a word without requiring additional tokens to either join or space them. The level 3 mode will result it separate word/prefix and suffix tokens, which is not great for compression but it will be easy to learn for the language model.
from tokenmonster.
The new version (1.0) is finally released. No prebuilt vocabularies yet though, the first few of those will be up tomorrow.
from tokenmonster.
Practically speaking, there will be two types of tokens: words and punctuation. We can also call them alphabetic and symbolic.
Symbolic is spelled precisely, and preserves all the whitespace: !
, !
, and !
are all different tokens, if optimizer decides so.
No additional whitespace is added before or after symbols. I'd guess, Chinese hieroglyphs would also fall into this category, because they don't use spaces or capitalization.
Alphabetic tokens are subject to Capcode and the proposed Spacecode.
Tokenizer expects one space between pairs of alphabetic tokens, and token-to-string decoder would insert them by default.
The default case is one space between words.
If two words are unconventionally separated with zero, two, three or more spaces, there will be special tokens between them:
Input | Tokens | Comment |
---|---|---|
helloworld |
hello +world |
Unusual concatenation requires additional tokens |
hello world |
hello world |
Most common default case is just 2 tokens |
hello world |
hello world |
Two spaces |
hello world |
hello world |
Three spaces |
hello, world |
hello , world |
Symbolic token contains all the spaces it needs |
Just like in Capcode, we might foresee modifier tokens:
+
single concatenation of two tokens[+
and+]
to concatenate longer words, but its usefulness needs to be tested.
My guess, is that it would improve compression, and semantic clarity of most of English texts.
from tokenmonster.
The issue with all methods is what to do with words made of multiple tokens. The current version prefers to tokenize with a space after words. I would have liked to go with that and implement with a backspace token, but the issue there is that it will mess with the decoding since a future token could affect a previous token. That's why I had to go with a forward delete.
By assuming a space and adding tokens for non-space, it would mean (almost) doubling the number of tokens required to represent multi-token words. The D decodeToken adds the opportunity for consistency but the optimizer doesn't have to take it, and it doesn't always (I can see in my tests.) Partly because tokens that don't begin with a space can be used as parts of longer words as well as in individual words.
The current version method of letting it choose anything has the issue that it produces, as you showed, multiple versions of words. That makes take longer (require more training) for a small language model to learn. My solution is to offer those different "modes" so the user can choose whether they prefer compression or consistency. But it would be nice if there were a one size fits all solution.
from tokenmonster.
If you assume a space, you need a token to undo it
If you don't assume it, you need a token to add it
Exactly! The idea is to assume a space (between pairs of specific tokens) and utilize a dedicated token to remove it when necessary.
normalize for one direction
OpenAI adopted a similar approach:
hello world,world
is encoded as hello
, world
, ,
, world
, where the two "world" tokens are completely different because of the space. On top of that, World
and World
are two more tokens, resulting in 4 distinct tokens for what humans might consider single word, just placed differently.
helloworld
becomes hell
ow
orld
, which will make German language much harder to tokenize (they omit spaces quite often!)
always before
Yes, OpenAI uses the "space-ish" symbol Δ
, representing whitespace or word boundaries, stored in the vocab.json. This symbol is then cleaned up during decoding.
always before, after, never, or both
because there is only one space between words
Only never looks reasonable to me - space is just not a part of the word! It's a technical detail, which must be handled by the tokenizer. All programming languages agree with this view - they discard whitespace and comments as soon as possible (Python uses indentation level to replace curly braces, but still ignores spaces around =
). Decoding is similar to automatic code formatting - it would insert spaces appropriately and automatically.
Of course, we need exceptions for written natural languages. That's why new line and two spaces tokens exist, and that's why I'm proposing no space token as well.
My motivation is not the zip-style compression, but rather enhancing the AI model's efficiency by maintaining the semantic relationship between tokens.
from tokenmonster.
Shall we run an experiment?
- Tokenize a corpus with a decent vocab.
- Patch the vocab by trimming all the spaces from alphabetic tokens (and remove collisions).
- Redo the same corpus.
Current tokenizer will be forced to insert - Simulate the proposal:
4.1. How many single-space tokens would be removed (assumed space between words by default)
4.2. How many word-joiner tokens would be added (concatenated word pieces)
This gives us the lower estimate of the possible gains.
In practice, the improvement will be better because patched vocab will be ΒΌ smaller.
from tokenmonster.
I appreciate your input and I'd be happy for the collaboration.
In terms of compression it won't be better, because the vocabularies are already optimal. If you run the training multiple times it generates the same vocabulary, give or take a few of the borderline tokens.
What makes a difference to the final vocabulary is changing the tokenization algorithm greediness, and the normalization in some cases, and of course the vocabulary size.
I'm saying that to lead up to an important point here, which is that 100256 is a whole different beast than, let's say 32000. 100256 is so big it has spare tokens and the optimizer has plenty of available tokens for making different versions of words merged with punctuation, or with spaces on different sides. That's not true of 32000. At 32000, it's feeling the pressure.
So any analysis you're doing is best done on 32000, which is the ideal one-fits-all model size in my opinion.
To implement your suggestion, it's best done on the tokens that come out of getalltokens before the training. Otherwise you're comparing two different vocabulary sizes. However it might be quite similar to the difference between mode 0 and mode 3 in the new version, I'd suggest you wait for that first.
Another issue with a merge token is that it adds a lot of complexity to the tokenization. It'd be massively more complicated and run several times slower. That's because it needs to check for not just the next token but the possibility of a merge in various positions. The easiest/most-efficient way to implement that would be to add the spaces to the tokens and dataset during training and then swap that over at the end, but it's just the same as using extra space and delete as I'm already doing (but the other way around.)
from tokenmonster.
Ok, let's wait for your upcoming results with the delete token!
I've checked all available sizes of English:
Vocab | Alphabetic | ...of Vocab | Distinct after trim | Duplicates | ...of Distinct | ...of Vocab |
---|---|---|---|---|---|---|
100256 | 67569 | 67% | 55579 | 11990 | 22% | 12% |
65536 | 45148 | 69% | 36969 | 8179 | 22% | 12% |
50256 | 35129 | 70% | 28620 | 6509 | 23% | 13% |
32000 | 22896 | 72% | 18509 | 4387 | 24% | 14% |
24000 | 17499 | 73% | 14106 | 3393 | 24% | 14% |
Rules:
- Token is alphabetic if it contains only letters
a
..z
,C
, spaces,,
,.
, or an'
, and at least one letter. - Distinct after trim is counting alphabetic tokens after removing all front and back whitespace
- Duplicates are the difference
Fractions remain consistent for different vocab sizes. If anything, I'd say tokenizer can be improved even more for smaller vocab sizes.
the vocabularies are already optimal
Yes, certainly.
I also don't suggest to change any rules of optimizer.
changing the tokenization algorithm greediness, and the normalization
Yes, exactly that. It feels like the whitespace normalization could improve compression: a quarter of alphabetic tokens can be replaced with a new rule in tokenizer, and the TokenMonster could nom-nom-nom 12-14% more tokens for other needs.
Given new tokenizer rule, the vocab optimizer will be affected and could do even better job.
The specific way to do the space normalization seems different in our minds, but I agree - eventually it must lead to the same result. I'm looking forward :-)
Thank you for a great work!
from tokenmonster.
That's brilliant, thank you! I can't wait to see the new vocab :-)
from tokenmonster.
Related Issues (20)
- Continuous training: Deleted 0 of 0 tokens; Remaining 0 tokens; reachedMidway withinVocabX2 reachedVocab HOT 1
- Tokenize strings of only N-types of characters? HOT 4
- panic: assignment to entry in nil map HOT 1
- Wrapping lib in a go cli client HOT 2
- Meaning of C and D HOT 1
- C implementation HOT 1
- hello! HOT 1
- Idea: Wouldn't it be possible for Tokenmonster to stop when it reaches the idea vocab size? HOT 2
- RuntimeError: tokenmonsterserver: Cannot open or save vocabulary file, please check permissions HOT 3
- code-65536 models cannot decode HOT 1
- What is the difference between `50256-consistent-oneword` and `50256-consistent`? HOT 1
- Implemented in the new AI framework Zeta HOT 1
- Humble question regarding JS performance HOT 1
- Special tokens not showing up correctly when tokenized. HOT 1
- Inquiry on Extending Algorithm to Other Languages HOT 2
- Update on multilingual HOT 2
- "vocab.load_multiprocess_safe" doesn't work while multi-processing. HOT 1
- Hangs with PyTorch data loaders when `num_workers > 0`
- Preferred citation `bibtex`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tokenmonster.