Don't support Chinese? about jiwer HOT 4 OPEN

jitsi commented on June 7, 2024

Don't support Chinese?

from jiwer.

Comments (4)

nikvaessen commented on June 7, 2024

Hi,

I made this example script, with references and predictions taken from https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn

from jiwer import wer, cer

ground_truths = [
    "宋朝末年年间定居粉岭围。",
    "渐渐行动不便",
    "二十一年去世。",
    "他们自称恰哈拉。",
    "局部干涩的例子包括有口干、眼睛干燥、及阴道干燥。",
    "嘉靖三十八年，登进士第三甲第二名。",
    "这一名称一直沿用至今。",
    "同时乔凡尼还得到包税合同和许多明矾矿的经营权。",
    "为了惩罚西扎城和塞尔柱的结盟，盟军在抵达后将外城烧毁。",
    "河内盛产黄色无鱼鳞的鳍射鱼。",
]

hypothesis = [
    "宋朝末年年间定居分定为",
    "建境行动不片",
    "二十一年去世",
    "他们自称家哈",
    "菊物干寺的例子包括有口肝眼睛干照以及阴到干",
    "嘉靖三十八年登进士第三甲第二名",
    "这一名称一直沿用是心",
    "同时桥凡妮还得到包税合同和许多民繁矿的经营权",
    "为了曾罚西扎城和塞尔素的节盟盟军在抵达后将外曾烧毁",
    "合类生场环色无鱼林的骑射鱼",
]

wer_score = wer(truth=ground_truths, hypothesis=hypothesis)
print(f"{wer_score=}")

cer_score = cer(truth=ground_truths, hypothesis=hypothesis)
print(f"{cer_score=}")

Which outputs:

wer_score=1.0
cer_score=0.30201342281879195

What would the correct answer be? Does word_error_rate even make sense in zh-cn, as as far as I know, each character is a word?

from jiwer.

haha010508 commented on June 7, 2024

Thanks for your reply, but the wer_score=1.0, is mean totally wrong word, i do not think so, for example, ground_truths = 宋朝末年年间定居粉岭围, hypothesis = "宋朝末年年间定居分定为" only last 3 words is wrong, so the wer must be < 1.0. if you improve the result, i can test it again. thanks!

from jiwer.

nikvaessen commented on June 7, 2024

Semantically, I assume that you need the "character" error rate instead of the word error rate, as I assume they should be equivalent for Chinese. Therefore, is the CER score of 0.3 correct?

from jiwer.

haha010508 commented on June 7, 2024

yes, CER is correct, i just focused on WER yesterday, because CER is confuse for chinese words.

from jiwer.

Recommend Projects

Don't support Chinese? about jiwer HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent