Comments (2)
Hi @GeorgeS2019 ,
Thanks for your comments.
I may not understand why you need to have a .NET tokenization library. Can you please specify which scenario you would like to use it ?
For Seq2SeqSharp, you can use any tokenization library for data processing. Seq2SeqSharp only care about tokens as input and it also outputs tokens. For example: In the Release Package, if you open test batch file, such as test_enu_chs.bat, you will find it calls "spm_encode.exe" firstly to encode given input sentences to BPE tokens, then call Seq2SeqConsole tool, and finally calls "spm_decode.exe" to decode BPE tokens back to sentences. Both "spm_encode" and "spm_decode" are from Google's SentencePiece project.
In addition, the release package includes vocabularies and models for 8 languages (Chinese, German, English, French, Italian, Japanese, Korean and Russian) so far. They were all built by SentencePiece library.
from seq2seqsharp.
@zhongkaifu The BlingFire of Microsoft provide HuggingFace tokenizers very similar to those provided by HuggingFace BUT with claimed better performance.
e.g. GPT2 tokenizer provided by BlingFire matches exactly the Vocab size as that of HuggingFace. The library provide additional information on how to create your custom tokenizer based on the diverse templates (close to complete) as those of HuggingFace
from seq2seqsharp.
Related Issues (20)
- Didn't save the model? HOT 7
- Error: C# 8.0 language feature HOT 1
- sentencepiece.dll problem in the API HOT 2
- SeqClassification Validation HOT 16
- Exception: 'The weight '.LayerNorm' has been released, you cannot access it.' HOT 10
- CPU_MKL Error converting value "CPU_MKL" to type 'Seq2SeqSharp.ProcessorTypeEnums HOT 6
- sqc.m_srcEmbedding_p.GetNetworkOnDevice(k).GetWeightAt() HOT 1
- GPTconsole HOT 4
- Target vocabulary size fixed to 45000 HOT 5
- Contextual embeddings HOT 22
- Train with general sequences of symbols HOT 2
- Moment of updating weights HOT 4
- Issues to get started with "Seq2SeqClassificationConsole" HOT 33
- Matrix initialization method HOT 4
- No requirement.txt in this repo HOT 2
- Sudden high increase in memory consumption while training a seq2seq model and validation happens HOT 11
- Setting FocalLossGamma = 2 causes weight corruption in the beginning of the seq2seq model training
- The checkpoint to save the model regularly should not depend on validation HOT 1
- Serialization of Seq2seq model is wrong
- SeqLabel model backward compatibility is broken by latest update HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seq2seqsharp.