Comments (6)
Closing this since the issue has been solved.
Feel free to reopen if you have any further questions @MiroFurtado
from openkiwi.
oops - didn't mean to tag as bug!
from openkiwi.
Hello @MiroFurtado
First of all, thanks for your interest in OpenKiwi!
You are right, the artificially generated training corpus (with ~500k triplets) we reference on our paper is not available publicly. It is based on the in-domain German corpus provided by WMT which is what we used to pre-train the predictors. We provide a link for this corpus in our Quickstart section here.
We had no plans for making this data available, but you raise a valid concern for reproducibility.
Thanks for bringing this to our attention!
We have to take some things into consideration but we will get back to you soon!
from openkiwi.
Great! Let me know what you end up deciding.
from openkiwi.
Actually the triplets are available in the data section on this site. The download there contains both the 500k triplets we used, and a larger corpus of 4 million triplets.
To generate QE tags from the triplets, you can follow the instructions in this repository
from openkiwi.
Hello @MiroFurtado ,
I uploaded the missing tags and sentence scores for the artificial roundtrip data to our releases.
If you wanted to recreate these files using the repository I posted in my previous comment, you need to train a FastAlign model, which is used to generate the source tags. We trained FastAlign on the English-German indomain corpus of 3 million parallel sentences, if you use a different FastAlign model you will get slightly different source tags.
from openkiwi.
Related Issues (20)
- TypeError: cannot unpack non-iterable NoneType object HOT 1
- The prediction process is not complete by Predictor Estimator. HOT 5
- OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. HOT 2
- Do openKiwi have confident score? HOT 1
- Error Pre-Training Predictor: "model -> encoder -> encode_source extra fields not permitted (type=value_error.extra)" HOT 1
- some confusions
- pkgutil.iter_modules() error: 'PosixPath' object has no attribute 'startswith'
- Got exception when import kiwi
- Seems that maximum token support for a sentence is 512?
- PicklingError: Can't pickle <class 'kiwi.data.encoders.wmt_qe_data_encoder.InputFields[PositiveInt]'>: attribute lookup InputFields[PositiveInt] on kiwi.data.encoders.wmt_qe_data_encoder failed HOT 2
- Do you need to tokenize your data when using a BERT/ROBERTA model?
- Pretrain config file
- What are source_pos and target_pos in the train_config.yaml?
- Why does it need "--model" paramter when I give a specific config? HOT 2
- some confusions
- some problems about data without alignments HOT 11
- I suppose that the code comment should be remove. HOT 2
- Error at Predictor Training: "Predictor is not a subclass of QESystem" HOT 2
- OSError: Can't load weights for 'xlm-roberta-base'. HOT 16
- open cannot unpack non-iterable NoneType object HOT 16
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openkiwi.