Comments (8)
We used the following pipeline (passing in a word_list
):
pipeline = []
pipeline.append(jiwer.ToLowerCase())
pipeline.append(jiwer.RemoveMultipleSpaces())
pipeline.append(jiwer.RemoveWhiteSpace(replace_by_space=True))
pipeline.append(jiwer.SentencesToListOfWords(word_delimiter=" "))
pipeline.append(jiwer.RemoveSpecificWords(word_list.split(",")))
pipeline.append(jiwer.RemovePunctuation())
pipeline.append(jiwer.Strip())
pipeline.append(jiwer.RemoveEmptyStrings())
return jiwer.Compose(pipeline)
- Unclear whether we can just swap in
ReduceToListOfListOfWords
- Our team must use an exact (older) version of this very nice library.
from jiwer.
I intended 2.2 to 2.3 to a be a major change, and also didn't think hard enough about depreciation policies, my bad.
In order to fix the bug mentioned in #46, I had to change the data structure used to compute the WER. Your pipeline now needs to end with the ReduceToListOfListOfWords
transform.
Do you have some unit tests over your pipeline? I think the following should lead to the same WER:
pipeline = []
pipeline.append(jiwer.ToLowerCase())
pipeline.append(jiwer.RemoveWhiteSpace(replace_by_space=True))
pipeline.append(jiwer.RemoveMultipleSpaces()) # note, I believe this must come after RemoveWhiteSpace(True)
pipeline.append(jiwer.RemoveSpecificWords(word_list.split(",")))
pipeline.append(jiwer.RemovePunctuation())
pipeline.append(jiwer.Strip())
pipeline.append(jiwer.RemoveEmptyStrings())
pipeline.append(jiwer.ReduceToListOfListOfWords(word_delimiter=" "))
return jiwer.Compose(pipeline)
from jiwer.
Thanks for the suggestion, I will try it out.
As for deprecation, I understand, I will likely do a forced dependency upgrade on our users.
from jiwer.
This pipeline is close but not quite correct. I build a pipeline as you suggest, then ultimately fail in the compute_measures
function.
cleaned_ref = pipeline(reference)
cleaned_hyp = pipeline(hypothesis)
measures = jiwer.compute_measures(cleaned_ref, cleaned_hyp) 'line 211 in stack track below
Traceback (most recent call last):
File "analyze.py", line 252, in <module>
main()
File "analyze.py", line 246, in main
results = analyzer.analyze()
File "analyze.py", line 211, in analyze
measures = jiwer.compute_measures(cleaned_ref, cleaned_hyp)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/measures.py", line 210, in compute_measures
truth, hypothesis, truth_transform, hypothesis_transform
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/measures.py", line 321, in _preprocess
transformed_truth = truth_transform(truth)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/transforms.py", line 76, in __call__
text = tr(text)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/transforms.py", line 55, in __call__
return self.process_list(sentences)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/transforms.py", line 202, in process_list
return [self.process_string(s) for s in inp]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/transforms.py", line 202, in <listcomp>
return [self.process_string(s) for s in inp]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/transforms.py", line 199, in process_string
return re.sub(r"\s\s+", " ", s)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 192, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
I use compute_measures
for two reasons:
- Convenience of computing all measures at once
- So that I can later output the original & transformed strings (useful for analysis)
from jiwer.
I used a list-comprehension to optionally flatten my lists before sending to compute_measures
This fails in case of insertions in the hypothesis:
File "analyze.py", line 216, in analyze
measures = jiwer.compute_measures(cleaned_ref, cleaned_hyp)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/measures.py", line 210, in compute_measures
truth, hypothesis, truth_transform, hypothesis_transform
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/jiwer/measures.py", line 329, in _preprocess
len(transformed_truth), len(transformed_hypothesis)
ValueError: number of ground truth inputs (8) and hypothesis inputs (9) must match.
My line 216 is:
measures = jiwer.compute_measures(cleaned_ref, cleaned_hyp)
from jiwer.
cleaned_ref = pipeline(reference)
cleaned_hyp = pipeline(hypothesis)
measures = jiwer.compute_measures(cleaned_ref, cleaned_hyp) 'line 211 in stack track below
In the above code, you apply the pipeline before compute_measures
, while you should give the pipeline as a keyword argument:
measures = compute_measures(
reference,
hypothesis,
truth_transform=pipeline,
hypothesis_transform=pipeline,
)
Also, does the length between reference and hypothesis differ? If that's the case, you must use ReduceToSingleSentence
before ReduceToListOfListOfWords
. It would be easiest to debug if you could share a complete, running example, including reference
and hypothesis
.
from jiwer.
Aha, the pair of ReduceToSingleSentence
and ReduceToListOfListOfWords
were critical. Now my pipeline runs to completion, with accurate results.
To summarize, I used to run:
pipeline = []
pipeline.append(jiwer.ToLowerCase())
pipeline.append(jiwer.RemoveMultipleSpaces())
pipeline.append(jiwer.RemoveWhiteSpace(replace_by_space=True))
pipeline.append(jiwer.SentencesToListOfWords(word_delimiter=" "))
pipeline.append(jiwer.RemoveSpecificWords(word_list.split(",")))
pipeline.append(jiwer.RemovePunctuation())
pipeline.append(jiwer.Strip())
pipeline.append(jiwer.RemoveEmptyStrings())
return jiwer.Compose(pipeline)
I have updated to run
pipeline = []
pipeline.append(jiwer.ToLowerCase())
pipeline.append(jiwer.RemoveWhiteSpace(replace_by_space=True))
pipeline.append(jiwer.RemoveMultipleSpaces())
pipeline.append(jiwer.RemoveSpecificWords(word_list.split(",")))
pipeline.append(jiwer.RemovePunctuation())
pipeline.append(jiwer.Strip())
pipeline.append(jiwer.RemoveEmptyStrings())
pipeline.append(jiwer.ReduceToSingleSentence())
pipeline.append(jiwer.ReduceToListOfListOfWords(word_delimiter=" "))
return jiwer.Compose(pipeline)
I'm going to run a few more test cases but I think it's working. Thanks @nikvaessen for all the help!
from jiwer.
`def chunked_cer(targets, predictions, chunk_size=None):
_predictions = [char for seq in predictions for char in list(seq)]
_targets = [char for seq in targets for char in list(seq)]
if chunk_size is None: return jiwer.wer(_targets, _predictions)
start = 0
end = chunk_size
H, S, D, I = 0, 0, 0, 0
while start < len(targets):
_predictions = [char for seq in predictions[start:end] for char in list(seq)]
_targets = [char for seq in targets[start:end] for char in list(seq)]
chunk_metrics = jiwer.compute_measures(_targets, _predictions)
H = H + chunk_metrics["hits"]
S = S + chunk_metrics["substitutions"]
D = D + chunk_metrics["deletions"]
I = I + chunk_metrics["insertions"]
start += chunk_size
end += chunk_size
return float(S + D + I) / float(H + S + D)`
I face a error when use jiwer
ValueError: number of ground truth inputs (15787) and hypothesis inputs (14746) must match.
from jiwer.
Related Issues (20)
- Question: How can I get words alignment between ground_truth and hypothesis? HOT 3
- module 'jiwer.transforms' has no attribute 'ReduceToListOfListOfWords' HOT 2
- Don't support Chinese? HOT 4
- AttributeError: module 'jiwer' has no attribute 'cer'
- RemovePunctuation does not remove smart/curly quotes HOT 2
- Avoid error when a string in the truth is empty after transformation HOT 2
- Alignment options similar to `fstalign` HOT 1
- Batch vs Individual results are not same HOT 6
- Update Levenshtein dependency to maintained version
- Major performance regression in 2.5.0 for jiwer.transforms.RemovePunctuation HOT 2
- jiwer WER runs very fast , compared to Torchmetrics WER how? HOT 1
- Current licenses might not be allowed HOT 2
- jiwer.visualize_measures doesn't work as in the docs HOT 2
- Version 3.0.0 can produce wrong results HOT 1
- Regarding visualize_alignment() function. HOT 1
- Apparent WER bug? HOT 2
- Update rapidfuzz version HOT 1
- jiwer gives an error when passed a very long list of strings HOT 6
- Can't
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jiwer.