Comments (4)
Hi,
I made this example script, with references and predictions taken from https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn
from jiwer import wer, cer
ground_truths = [
"宋朝末年年间定居粉岭围。",
"渐渐行动不便",
"二十一年去世。",
"他们自称恰哈拉。",
"局部干涩的例子包括有口干、眼睛干燥、及阴道干燥。",
"嘉靖三十八年,登进士第三甲第二名。",
"这一名称一直沿用至今。",
"同时乔凡尼还得到包税合同和许多明矾矿的经营权。",
"为了惩罚西扎城和塞尔柱的结盟,盟军在抵达后将外城烧毁。",
"河内盛产黄色无鱼鳞的鳍射鱼。",
]
hypothesis = [
"宋朝末年年间定居分定为",
"建境行动不片",
"二十一年去世",
"他们自称家哈",
"菊物干寺的例子包括有口肝眼睛干照以及阴到干",
"嘉靖三十八年登进士第三甲第二名",
"这一名称一直沿用是心",
"同时桥凡妮还得到包税合同和许多民繁矿的经营权",
"为了曾罚西扎城和塞尔素的节盟盟军在抵达后将外曾烧毁",
"合类生场环色无鱼林的骑射鱼",
]
wer_score = wer(truth=ground_truths, hypothesis=hypothesis)
print(f"{wer_score=}")
cer_score = cer(truth=ground_truths, hypothesis=hypothesis)
print(f"{cer_score=}")
Which outputs:
wer_score=1.0
cer_score=0.30201342281879195
What would the correct answer be? Does word_error_rate
even make sense in zh-cn, as as far as I know, each character is a word?
from jiwer.
Thanks for your reply, but the wer_score=1.0, is mean totally wrong word, i do not think so, for example, ground_truths = 宋朝末年年间定居粉岭围, hypothesis = "宋朝末年年间定居分定为" only last 3 words is wrong, so the wer must be < 1.0. if you improve the result, i can test it again. thanks!
from jiwer.
Semantically, I assume that you need the "character" error rate instead of the word error rate, as I assume they should be equivalent for Chinese. Therefore, is the CER score of 0.3 correct?
from jiwer.
yes, CER is correct, i just focused on WER yesterday, because CER is confuse for chinese words.
from jiwer.
Related Issues (20)
- AttributeError: module 'jiwer' has no attribute 'cer'
- SentencesToListOfWords is removed after 2.2.0 HOT 8
- RemovePunctuation does not remove smart/curly quotes HOT 2
- Avoid error when a string in the truth is empty after transformation HOT 2
- Alignment options similar to `fstalign` HOT 1
- Batch vs Individual results are not same HOT 6
- Update Levenshtein dependency to maintained version
- Major performance regression in 2.5.0 for jiwer.transforms.RemovePunctuation HOT 2
- jiwer WER runs very fast , compared to Torchmetrics WER how? HOT 1
- Current licenses might not be allowed HOT 2
- jiwer.visualize_measures doesn't work as in the docs HOT 2
- Version 3.0.0 can produce wrong results HOT 1
- Regarding visualize_alignment() function. HOT 1
- Apparent WER bug? HOT 2
- Update rapidfuzz version HOT 1
- jiwer gives an error when passed a very long list of strings HOT 6
- Can't
- jiwer.wer(outputs_true, outputs_pred, standardize=True) HOT 1
- Is it possible just to get the number of errors? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jiwer.