Comments (11)
token means word piece? alphabet means character?
from wenet.
Yes, you understand correct
from wenet.
character can be used as model unit, just change the pipeline to use char. Generally, char has lower accuracy than word piece, so word piece is used as default model unit.
from wenet.
With low resource data character based model shows better results then word piece model, based on my small research.
As far as i understand, only i need to change is the format.data file for training. if i use characters to train, i have to change, for instance, "▁CHAPTER ▁ONE" to "CHAPTER ONE" and corresponding token ids and obviously tokens shape and i don't need to decode text_bpe in test stage, am i right?
from wenet.
for "CHAPTER ONE", the format.data should be "C H A P T E R [space] O N E", and the dict should be A-Z plus [space].
from wenet.
for "CHAPTER ONE", the format.data should be "C H A P T E R [space] O N E", and the dict should be A-Z plus [space].
where [space] could be also replaced with a special char "▁" to simplify data processing.
from wenet.
For dict is:
blank 0
unk 1
A 2
...
Z 28
sos/eos 29
Am i right?
from wenet.
add one special symbol for space.
from wenet.
Okey, many thanks and that's all what i need to change?
from wenet.
- the dict
- training corpus should be prepared as char, such as "CHAPTER ONE", the format.data should be "C H A P T E R [space] O N E"
from wenet.
Many thanks, sorry for dumb questions !
from wenet.
Related Issues (20)
- RuntimeError: missing value in ASRModel forward() HOT 2
- 预训练模型下载不了 HOT 7
- Web Runtime HOT 1
- Gigaspeech的Conformer bidecoder模型 HOT 1
- I had a question regarding fine-tuning a Wenet model. Can a pre-trained model trained using U2++ conformer be loaded as a checkpoint to fine-tune a model with fixed chunk size(4,8 or 16)? HOT 1
- wenet-whisper-finetune中attention与ctc解码差异很大 HOT 4
- Support for training on single accelerator? HOT 4
- paraformer-8k模型的问题 HOT 1
- 使用wenet在AISHELL-4数据集中报错 HOT 1
- paraformer模型推理报错 HOT 4
- http方式部署服务后,使用postman去post这个服务的话,json里面应该怎么写? HOT 1
- How to add new words during fine-tuning? HOT 2
- meet a question when running android to compute rtf HOT 1
- expected runtime performance gains with ipex and other frameworks ?
- 有没有能部署在android的Runtime Model支持中英文混合识别的模型呀,现在只有中文或者英文的 HOT 1
- Do online augmentation and global_cmvn conflict with each other?
- Convert Whisper-wenet to ONNX HOT 1
- K2 hlg 解码问题请教 HOT 1
- [paraformer] When is ONNX GPU export supported. HOT 7
- Segmentfault in multiprocessing DataLoader when training on Kunpeng cpu HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wenet.