Giter Site home page Giter Site logo

Don't support Chinese? about jiwer HOT 4 OPEN

jitsi avatar jitsi commented on June 7, 2024
Don't support Chinese?

from jiwer.

Comments (4)

nikvaessen avatar nikvaessen commented on June 7, 2024

Hi,

I made this example script, with references and predictions taken from https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn

from jiwer import wer, cer

ground_truths = [
    "宋朝末年年间定居粉岭围。",
    "渐渐行动不便",
    "二十一年去世。",
    "他们自称恰哈拉。",
    "局部干涩的例子包括有口干、眼睛干燥、及阴道干燥。",
    "嘉靖三十八年,登进士第三甲第二名。",
    "这一名称一直沿用至今。",
    "同时乔凡尼还得到包税合同和许多明矾矿的经营权。",
    "为了惩罚西扎城和塞尔柱的结盟,盟军在抵达后将外城烧毁。",
    "河内盛产黄色无鱼鳞的鳍射鱼。",
]

hypothesis = [
    "宋朝末年年间定居分定为",
    "建境行动不片",
    "二十一年去世",
    "他们自称家哈",
    "菊物干寺的例子包括有口肝眼睛干照以及阴到干",
    "嘉靖三十八年登进士第三甲第二名",
    "这一名称一直沿用是心",
    "同时桥凡妮还得到包税合同和许多民繁矿的经营权",
    "为了曾罚西扎城和塞尔素的节盟盟军在抵达后将外曾烧毁",
    "合类生场环色无鱼林的骑射鱼",
]

wer_score = wer(truth=ground_truths, hypothesis=hypothesis)
print(f"{wer_score=}")

cer_score = cer(truth=ground_truths, hypothesis=hypothesis)
print(f"{cer_score=}")

Which outputs:

wer_score=1.0
cer_score=0.30201342281879195

What would the correct answer be? Does word_error_rate even make sense in zh-cn, as as far as I know, each character is a word?

from jiwer.

haha010508 avatar haha010508 commented on June 7, 2024

Thanks for your reply, but the wer_score=1.0, is mean totally wrong word, i do not think so, for example, ground_truths = 宋朝末年年间定居粉岭围, hypothesis = "宋朝末年年间定居分定为" only last 3 words is wrong, so the wer must be < 1.0. if you improve the result, i can test it again. thanks!

from jiwer.

nikvaessen avatar nikvaessen commented on June 7, 2024

Semantically, I assume that you need the "character" error rate instead of the word error rate, as I assume they should be equivalent for Chinese. Therefore, is the CER score of 0.3 correct?

from jiwer.

haha010508 avatar haha010508 commented on June 7, 2024

yes, CER is correct, i just focused on WER yesterday, because CER is confuse for chinese words.

from jiwer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.