jmaczan / bpe.c Goto Github PK
View Code? Open in Web Editor NEWByte-Pair Encoding tokenizer for training large language models on huge datasets. I don't know C, so most of the code comes from AI :D I hope to learn by rewriting it and making changes, fixes etc
License: GNU General Public License v3.0