Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Updating stats on quality of translation about lady HOT 4 OPEN

farinamhz commented on September 7, 2024 1

Updating stats on quality of translation

from lady.

Comments (4)

Lillliant commented on September 7, 2024

Hi @farinamhz, I've calculated the statistics for twitter dataset and google translate's review files, which I have uploaded to the OneDrive paths under LADy0.2.0.1 > statistics.

from lady.

farinamhz commented on September 7, 2024

Previous conversation:

[Thursday 9:16 PM] Christine Wong
Hi Farinam Hemmati Zadeh, I unfortunately cannot make it to Friday's progress meeting this week, but I've added my progress to the issues pages and have made a PR so it can be reviewed. In addition to the questions there, I've also noticed that the exact match metrics calculated using LADy0.2.0.0's semeval datasets are different from the semeval+ statistics. Would this be something to be concerned about?

[Friday 11:59 AM] Farinam Hemmati Zadeh
Hey Christine, no worries! Thank you very much for the update and your work! You mean the newly translated reviews are different from the previously translated ones, right? If so, it is right as the translated reviews for LADy0.2.0.1 are from a new translator (googletranslate).

[Friday 12:42 PM] Farinam Hemmati Zadeh
Christine Wong I just realized that you said LADy0.2.0.0. LADy0.2.0.0 should be the same as before as only twitter has been added to this version. However, I wanted you to calculate the metrics for LADy0.2.0.1 which is for googletranslate results. I think there was a confusion between these two. Which one have you calculated now?
[Friday 12:44 PM] Farinam Hemmati Zadeh
In fact, previous results should not have any significant difference in compare with LADy 0.2.0.0. Did they have? Christine
[Friday 9:06 PM] Christine Wong
Hi Farinam Hemmati Zadeh, I've calculated the twitter metrics based on the LADy0.2.0.0 which was put into the readme.md. I've also calculated all the datasets (semeval + twitter) for the googletranslate results in LADy0.2.0.1 which are not in readme.md but uploaded to OneDrive.

[Friday 9:16 PM] Christine Wong
The results aren't too different (around 0.01 difference compare from LADy0.2.0.0 to the data in the readme.md), but I was wondering if it was alright to "mix" the results together, since it seems the twitter metrics may look different if it was produced at the same time as the metrics from the readme.

[Friday 9:19 PM] Christine Wong
I've also attached a run with the semeval15/16's result for better comparison: it seems like the newer version have higher em metric, which might be a good thing since it might suggest similar sentence structure, etc.?

from lady.

farinamhz commented on September 7, 2024

Hey @Lillliant,
Let's continue here.
I appreciate the updates you've provided. Everything is going well, but I'm facing some confusion regarding the calculation of BLEU and ROUGE scores. It appears that when dealing with longer tweets with diverse contexts, they do not yield accurate exact match results in comparison with semeval datasets. Perhaps we should consider exploring alternative metrics like BLEU and ROUGE in this context. So I want to make sure what are the inputs of BLEU and ROUGE.

from lady.

Lillliant commented on September 7, 2024

Hi @farinamhz, sure! I've attached my run of the twitter (LADy0.2.0.0) dataset here:

For bleu metrics:

dataset	pes_Arab_bleu	zho_Hans_bleu	deu_Latn_bleu	arb_Arab_bleu	fra_Latn_bleu	spa_Latn_bleu
twitter	0.2110	0.1892	0.4025	0.33383	0.3891	0.4439
semeval-2016-restaurant	0.3746	0.3065	0.5435	0.4465	0.5314	0.5864
semeval-2015-restaurant	0.3787	0.3080	0.5514	0.4523	0.5318	0.5895

For rouge metrics:

dataset	pes_Arab_rouge_f	zho_Hans_rouge_f	deu_Latn_rouge_f	arb_Arab_rouge_f	fra_Latn_rouge_f	spa_Latn_rouge_f
twitter	0.1889	0.1677	0.3307	0.2589	0.3117	0.3596
semeval-2016-restaurant	0.2802	0.2224	0.4258	0.3360	0.4089	0.4628
semeval-2015-restaurant	0.2783	0.2224	0.4332	0.3387	0.4076	0.4661

Here, the bleu metrics we had was obtained by computing the average of the bleu score means for each sentence calculated using weight=[(1.0,), (0.5, 0.5), (0.3333, 0.3333, 0.3333)].

The rouge metrics are obtained by computing the average of the F1-score means from rouge-1 to rouge-5 for each sentence.

from lady.

Updating stats on quality of translation about lady HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent