alasdairforsythe / tokenmonster Goto Github PK
View Code? Open in Web Editor NEWUngreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
License: MIT License
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
License: MIT License
I have two datasets, a train_set and an eval_set.
When using a single instance of the tokenizer, using the vocab.load_multiprocess_safe, passed to each dataset, the tokenizer simply refuses to function, regardless of whether or not it is frozen, or whether the datasets are active at the same time.
I am able to resolve the issue by using only vocab.load, however I get yelled at about multiprocessing, so I need to further debug by passing separate instances of the tokenizer to each dataset. This not ideal, but it is at least functional.
Simply as an FYI. I appreciate the work you've done so far, it's always nice to see independent coders and researchers doing cool things.
I'm currently training a language model with your tokenizer. I have to say that preliminary results are amazing and am very impressed.
I have a question though: how is uppercase handled by the tokenizer? It's maybe a side effect of the tokenizer, but my current model seems to produce a lot of fully uppercased texts (90% of the time), even though the training data does not contain that much uppercased text. What's strange is that the produced text is very coherent and likely comes from lowercased training data. I suspect that uppercase is some kind of state that can be switched on and off in your tokenizer and that the model does not learn to switch it off (EDIT: this is very likely, as my model is able to code html in uppercase). Another hypothesis is an error in your code.
I'm training on Eleuther's Pile in a way that I'm very familiar, and only the tokenizer changed, so it's not an error from this side of things.
First and foremost, very impressive work!
As a low-level JS performance enthusiast, I'd be interested to see how much faster I'd be able to make the JS implementation on V8 in particular (with expected gains across all JITs I'm aware of).
And by "faster" it's of course implied to mean repeatably, measurably, explainably, and significantly faster. (Just say no to microbenchmarks).
Main strategies include well-known run-of-the-mill techniques like enforcing 100% monomorphic code and other related JIT-appeasing goodness.
Would this be of any interest whatsoever to you? Absolutely fine if not, but I wanted to extend you a nerdy E.T. glow-finger of enthusiasm and test the waters before deciding to proceed on my own instead.
Apologies in advance for sending this much text your way unsolicited.
All the best, and again, great work. 👏
Just an idea. When I train Tokenmonster with an immensely big dataset, I realize at a certain small vocab size, the workers struggle to remove anymore from the vocab. I take it as a sign that the vocab-size is already approaching optimal when it reaches that state.
What do you think?
Originally posted by Calvinnncy97 September 4, 2023
Hey guys,
Specifically, I would like to ask which flags are set to train these 2 tokenizers? I can't find any flags that force tokenizer to have only 1 word tokens.
Thank you.
Hi, i was reaching out because i couldnt find a way to contact you privately so i apologize for how out of place this message is, would you happen to be available for a meeting some time soon? i can be reached at [email protected], or here directly if you prefer :)
Thank you for your work.
I tried to train vocab with a new code but it is failing
Loading 3.dict
2023/07/03 21:07:33 Parsing 3.special.json
Charset: UTF-8, Capcode Enabled
Optimization mode: 4 (strict)
Vocabulary size: 65536
Single byte tokens: 233
Loading normalized.tsv
panic: assignment to entry in nil map
goroutine 1 [running]:
main.main()
trainvocab.go:1551 +0x1d33
Hi,
I was just trying out the code tokenizers, seems like all the code-65636-*
models are all unable to decode:
import tokenmonster
tokenizer = tokenmonster.load("code-65536-balanced-nocapcode-v1")
tokens = tokenizer.tokenize("hello world") # [ 127 51042]
decoded_string = tokenizer.decode(tokens)
print(decoded_string)
> ''
The 100k and 32k models work.
I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.
How do I make sure my special tokens always map to the same id?
code to reproduce what I am seeing:
vocab = tokenmonster.load("englishcode-32000-consistent-v1")
vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)
vocab.resize(32000, reset_token_ids=False)
# Tokenize some text
text = [
"<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
"<s>Some text to turn into token IDs. <|im_end|>",
"<s>Some text to turn into token IDs....<|im_end|>",
]
MIT license has some conditions not present in 0BSD, making it more FOSS-friendly.
I have a suggestion to discuss that could enhance the tokenization in your already amazing approach. I wonder if there's a benefit to consistently pushing all spaces to the front (similar to what OpenAI does), end, or using some other strategy.
Currently, I don't see a specific strategy in english-100256-capcode
. The patterns seem to stem from the statistical properties of the corpus:
Format | Count of tokens (after trim) | Count of tokens (unmodified) |
---|---|---|
word | 47380 | 47125 |
word+space | 45516 | 48173 |
space+word | 1447 | 2652 |
space+word+space | 1447 | 2050 |
other | 4466 | 256 |
The difference between columns is subtle, and appears with multi-space tokens:
There is a noticeable overlap between formats. We can also count how many forms form each trimmed word has:
Forms | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Words | 68897 | 11286 | 471 | 727 |
68897 (69%) of all tokens are alright. They might have spaces today, but at least there is exactly one version.
If we address this, we can save 2 * 11286 + 3 * 471 + 4 * 727 = 26893 = 27% of token ids and reuse them for something else.
I also believe it might help you with performance, because some tokens will be combined early in the process.
Capcode does a great job at treating Greedy•
and greedy•
as the same token (40734). However, issues can arise when considering alternate spellings, such as •greedy
. By extending the Capcode idea to composite words, we could address these concerns.
What about •greedy
? If such token can possibly appear, it would introduce a near-duplicate. Currently there are no alternative spellings, so greedy+space is the only token. Can it be used at the end of the sentence? greedy.
Maybe it is exactly the use-case you foresee with delete token?
•and•
is 18564, and•
13207, there is no •and
All text tokens should have no spaces at the start or end. Punctuation tokens would remain unchanged.
In this approach, TokenMonster
and ungreedy
would be spelled using a version of zero-width joiner or word joiner from Unicode:
Token
+Monster
is an
un
+greedy
tokenizer
and
vocabulary
builder
!
Or, with Capcode:
^token
+^monster
is an
un
+greedy
tokenizer
and
vocabulary
builder
!
This extension could reduce the number of tokens used by 27% and repurpose them to bring more semantically meaningful words into the vocabulary. With more known individual words, there will be even less need to use concatenation token.
I believe implementing this idea would make the tokenization process more efficient and consistent. Looking forward to hearing your thoughts!
I'm getting a "dataset is required" error with this command:
./getalltokens -capcode true -charset UTF-8 -chunk-size 100000 -dataset /Users/me/wikitext-103-raw/wikitest.raw -max-token-length 1000 -micro-chunks 10 -min-occur 30 -min-occur-chunk 3 -output string enwiki_dict.txt -workers 1
What format should the vocab file be? I got the above error when I tried RWKV vocab which is a text file:
https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/rwkv_pip_package/src/rwkv/rwkv_vocab_v20230424.txt
I gave 777 permission to the vocab.
Greetings, there are many situations where relying on curl is enough for utilizing inference on otherwise not very capable machines where a python or js interpreter isn't always assumed.
In case it is of interest in regards to you, I would like to kindly request that you wrap the go library into a very basic CLI which would allow to leverage the library in a small self contained go executable. I'm really referring the simplest, bare minimum client, no need for thinking about interactive fuss like completions etc.
I'd normally go ahead and just do it but I never worked with go and dedicating a couple of days to get myself onboard is more or less not an option as of now. I sincerely hope this is something you've thought about doing at some point and you're interested and comfortable enough to make it happen, with the least possible effort without giving up any significant amount of your otherwise precious time =)
I'm looking forward to your answer, feel free to turn it down without any hesitation if you think its appropriate to do so, its totally understandable in any case! Thanks
Thanks for your work. Consider adding some Python tests and uploading a package on PyPi that works out-of-the-box. This is crucial for potential adoption. Adding some cython or pybind11 bindings may also get the code to work faster.
OS: Ubuntu 22.04
Python version: 3.11.8
PyTorch version: 2.2.1
Tokenmonster package version: 1.1.12
Other libraries: lightning==2.2.1
, datasets==2.18.0
Like in the title, I load the tokenizer with load_multiprocess_safe
, the dataset is just a bunch of plain text files to load and tokenize. I have tested each stage of loading and there are no problems until I wrap it in a DataLoader
and use num_workers > 0
, it hangs forever then.
Great work! I noticed however there's no implementation in C or C++, only in higher-level languages which may make it difficult to integrate into projects like llama.cpp. Is this something being worked on?
Hi your work looks great. Is it doing greedy tokenization in the sense that always picking the longest possible token?
Here's my multilang greedy tokenization experiment FYI:
https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py
Is there any update on the multilingual tokenizers? The project seems to be on pause.
Hi, I'm looking at this for tokenizing biological sequences: protein, DNA, RNA. These have between 4-22 letters, generally. When I use the procedure, it only finds the base-letters as tokens. The vocabulary that is produced consists of the base ascii characters
less vocab.txt
^A
^B
^C
^D
^E
^F
^G
^H
^K
^L
^M
^N
^O
^P
^Q
^R
^S
^T
^U
^V
^W
^X
^Y
^Z
^[
^\
^]
^^
^_
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
...
Even though none of those characters, except for ACGT were in the input text. The code I entered was this:
./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2
./trainvocab -charset utf-8 -dataset my_data.txt -dir vocab_dir -dictionary vocab.vc -max-token-length 64 -vocab 4095
./exporttokens ./vocab_dir/794556_5737.zlib vocab -capcode -charset UTF-8 -txt
Can you advise?
Impressed by Your Project
Dear alasdairforsythe,
I am genuinely impressed by your wonderful project and appreciate your sharing it. Thank you sincerely.
I'm curious to know if there is any simple explanation or documentation about the entire development process of your project.
If not, could you please provide a brief description of the overall algorithm, even if it's very approximate? I am familiar with concepts like BPE, BBPE, unigram, ngram, and word piece, as well as various packages like SentencePiece, TikToken, tokenizers, and transformers. Therefore, feel free to skip any basic information and directly share what improvements you've made, the overall development process, your objectives, and the approaches you took to solve specific problems.
I read on Reddit that your focus was on speed improvements, but I noticed you also reduced the vocab size. Could you elaborate on your overall approach to this?
Additionally, I am curious about where to start with your package to develop an efficient tokenizer for Korean. While I'm considering the BBPE method for creating an efficient Korean vocab, your advanced work in this area has prompted me to reach out for guidance.
Thank you for your time and insights.
Sincerely,
Daniel
Thanks for this amazing library. Looking forward to actually train and adapt some models for it.
After creating my first vocabulary I noticed that a lot of the tokens contain uppercase C and uppercase D. Do those have a special meaning? I could also see them referenced in the code, but I could not find the meaning.
Thanks in advance
Example:
tokens:
- token: "D"
id: 35
score: 0.006828829
encoded: true
- token: " und"
id: 2657
score: 0.0047021606
encoded: true
- token: " der"
id: 2099
score: 0.0032128973
encoded: true
- token: "C"
id: 34
score: 0.0031624683
encoded: true
- token: " die"
id: 2105
score: 0.002436903
encoded: true
- token: " von"
id: 2684
score: 0.0021727835
encoded: true
- token: ".C"
id: 271
score: 0.0020115946
encoded: true
- token: " für"
id: 5997
score: 0.0017581019
encoded: true
- token: "-DC"
id: 1163
score: 0.0017092729
encoded: true
- token: " des"
id: 2100
score: 0.0016576286
encoded: true
- token: " mit"
id: 2407
score: 0.0014818916
encoded: true
- token: " in"
id: 993
score: 0.0014810717
encoded: true
- token: ",C"
id: 259
score: 0.0014182056
encoded: true
- token: ","
Splendid to see this algorithm and the name is stellar. I've been excited to test it out since it was shared!
I finally processed my files and got a vocab file. I executed the command on a very small text size for testing purposes:
./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2
It gives seemingly no problem in output:
...
2023/06/08 09:55:34 Tokens before trimming: 350634
2023/06/08 09:55:34 Trimming final tokens for min 100
2023/06/08 09:55:34 Tokens after trimming: 14770
2023/06/08 09:55:34 Saving tokens list
2023/06/08 09:55:34 Done
I execute the next portion of the code:
./trainvocab -charset utf-8 -dataset my_data.txt -dir vocab_dir -dictionary vocab.vc -max-token-length 64 -vocab 1048575
It looks like I needed to set vocab to < total number of tokens after trimming. Perhaps that should be documented in the notes.
Hey I like TokenMonster alot and have implemented it into our framework, Zeta, the framework enabling one to build the best multi-modality transformer models!
https://github.com/kyegomez/zeta
#!pip install zetascale
from zeta.tokenizers import TokenMonster
tokenizer = TokenMonster("englishcode-32000-consistent-v1")
tokens = tokenizer.tokenize("Hello world!")
print(tokens)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.