Some of the popular subword-based tokenization algorithms are WordPiece, Byte-Pair Encoding (BPE), Unigram, and SentencePiece. The goal of this project is to get deeper into BPE and WordPiece.
http://www.gutenberg.org/cache/epub/16457/pg16457.txt
https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
* Make sure you have the Sample.txt file and the wikitext-103-raw folder to run the code.
- Explaning of BPE & WordPiece
- Comapring BPE & WordPiece
- Implementing BPE from scratch
- Implementing BPE & WordPiece by Hugging Face
- Applying these models on Gutenberg book & English Wikipedia & a sample text file given.
- Explaning & analyzing the differences
- Analyzing the results
Report is available here.