Do you all have any plans to make the artificially generated training corpus reference

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Actually the triplets are available in the data section on <a href="https://marian-nmt

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Artificially generated triplets? about openkiwi HOT 6 CLOSED

unbabel commented on July 20, 2024

Artificially generated triplets?

from openkiwi.

Comments (6)

captainvera commented on July 20, 2024 1

Closing this since the issue has been solved.

Feel free to reopen if you have any further questions @MiroFurtado

from openkiwi.

MiroFurtado commented on July 20, 2024

oops - didn't mean to tag as bug!

from openkiwi.

captainvera commented on July 20, 2024

Hello @MiroFurtado

First of all, thanks for your interest in OpenKiwi!

You are right, the artificially generated training corpus (with ~500k triplets) we reference on our paper is not available publicly. ~~It is based on the in-domain German corpus provided by WMT which is what we used to pre-train the predictors. We provide a link for this corpus in our Quickstart section here.~~

We had no plans for making this data available, but you raise a valid concern for reproducibility.
Thanks for bringing this to our attention!
We have to take some things into consideration but we will get back to you soon!

from openkiwi.

MiroFurtado commented on July 20, 2024

Great! Let me know what you end up deciding.

from openkiwi.

trenous commented on July 20, 2024

Actually the triplets are available in the data section on this site. The download there contains both the 500k triplets we used, and a larger corpus of 4 million triplets.

To generate QE tags from the triplets, you can follow the instructions in this repository

from openkiwi.

trenous commented on July 20, 2024

Hello @MiroFurtado ,

I uploaded the missing tags and sentence scores for the artificial roundtrip data to our releases.

If you wanted to recreate these files using the repository I posted in my previous comment, you need to train a FastAlign model, which is used to generate the source tags. We trained FastAlign on the English-German indomain corpus of 3 million parallel sentences, if you use a different FastAlign model you will get slightly different source tags.

from openkiwi.

Recommend Projects

Artificially generated triplets? about openkiwi HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent