An awesome list of data deduplication use cases, papers, tools, and methods.
Paper | Dataset Name | Final Dataset Size | Method Name | Hardware | License |
---|---|---|---|---|---|
NA | RedPajama | 1.2T Tokens | SimHash (partial) | Apache 2.0 | |
SlimPajama | 627B Tokens | MinHash + LSH | Apache 2.0 | ||
Arxiv | CulturaX | 6.3T Tokens | MinHashLSH (per language)1 | 600 AWS c5.24xlarge (96/192GB * 600) |
Footnotes
-
This uses a variant of the spark script from text-dedup ๐๏ธ; โฉ