Giter Site home page Giter Site logo

Comments (6)

mikemoritz avatar mikemoritz commented on May 21, 2024 1

I had also noticed that the English -> Chinese translation was just providing the English string back. Looking at multiple hypotheses for that shows that the "valid" translation is getting a lower score:

hypotheses=0
        batch=0:
                raw: {'tokens': ['▁H', 'ello', '▁world', '!'], 'score': -2.543808937072754}
                tokens: ['▁H', 'ello', '▁world', '!']
        debug translation:
                "Hello world!"
hypotheses=1
        batch=0:
                raw: {'tokens': ['▁', '希', '洛', '世界', '!'], 'score': -3.0267810821533203}
                tokens: ['▁', '希', '洛', '世界', '!']
        debug translation:
                "希洛世界!"

from argos-translate.

mikemoritz avatar mikemoritz commented on May 21, 2024 1

Thanks for the quick response.

Yes I ruled out the Stanza and SentencePiece steps as the source of non-determinism.

I didn't have any luck with CTranslate, but I did find this interesting paper from Google on their NMT and beam search implementation: https://arxiv.org/abs/1609.08144

Specifically this on length normalization/penalty:

With length normalization, we aim to account for
the fact that we have to compare hypotheses of different length. Without some form of length-normalization
regular beam search will favor shorter results over longer ones on average since a negative log-probability
is added at each step, yielding lower (more negative) scores for longer sentences.

They recommend defaults for length normalization/penalty and coverage penalty as 0.2. CTranslate defaults these to zero, but the forums do recommend them as tuning parameters. Setting these parameters to 0.2 did provide consistent results between my two hosts, so this could be something to consider but may need to be experimentally determined with your models?

Thanks for the details on the Chinese translation (and sorry for conflating it in this issue...).

from argos-translate.

PJ-Finlay avatar PJ-Finlay commented on May 21, 2024 1

I asked about length_penalty on the OpenNMT forum and it sounds like ideally this would fine-tuned for individual language pairs. For now I'm just setting it to 0.2 which seems like a better default then 0 based on the linked paper and keeps the simplicity of one pipeline for all languages.

I'm closing this issue now but leaving the other one open for Chinese translations.

from argos-translate.

PJ-Finlay avatar PJ-Finlay commented on May 21, 2024

Thanks for the detailed bug report and happy to hear your considering using the library. I didn't attempt to make translations deterministic but you're right this would be a good feature to have for people that need it.

You said, "[you] didn't notice any difference in the actual argos-translate parsing logic", so you've ruled out Stanza sentence boundary detection and SentencePiece tokenization as the source of non-determinism? If not it's possible that CTranslate is deterministic but the sentence boundary detection or tokenization is causing overall non-determinism.

I don’t think the random seed issue you linked is going to be the solution if this is a CTranslate issue. My understanding is that CTranslate does not depend on PyTorch and Argos Translate only requires PyTorch for Stanza. Also the models are trained using OpenNMT for TensorFlow.

I looked at the CTranslate documentation and I don’t see anything about it being deterministic or having the option of being deterministic. I'm not that familiar with the specifics of CTranslate and I would recommend making an issue on the CTranslate project. I've also gotten pretty good support from the OpenNMT forum in the past so you could post there too.

For debugging the only thing I can think of is that CTranslate generates a random seed somewhere or somehow behaves differently with different amounts of memory available. If this is true the best solution may be to use a container or virtual machine for a standard environment if you need deterministic results. It looks like a larger beam size uses more memory so if CTranslate is using a smaller beam size when there is less memory available then setting the beam size to 1 could give you deterministic translations (but potentially lower quality).

The Chinese translation is a separate issue that I am aware of and is caused by a lack of data. I got all of the translation data from the Opus Open Parallel Corpus and there’s a surprising lack of data available for the English-Chinese pair. There are only 333 million tokens available for English-Chinese which isn’t very many compared to other languages (there are 4.5 billion for English-German). This causes it to often repeat the English input when translating to Chinese especially with single words or short sentences. I’ve found that entering a longer sentence or putting a period after your input increases the likelihood of getting a real translation. This gets to your question of whether “Hello world!” is a good test string, it should be fine in most cases but a full sentence is probably closer to the training data and may give you better results.

The ideas I’ve had to fix this are to either find more data (including maybe a English-Chinese dictionary to supplement what is mostly full sentence data and help with short translations), or if this is over-training issue train for fewer epochs than I am for languages with more data. It’s interesting that if you generate multiple hypothesises you do get a real translation just at a worse score. I hadn’t thought to try this, and it seems like evidence for overfitting. There could be a hack where you use the second best translation for short English to Chinese translations but that’s probably not an ideal solution.

Please share if you find any good solutions, adding support of deterministic translations would be a nice feature to have. Ideally this could be done in a way that relies on a stable interface from CTranslate.

from argos-translate.

PJ-Finlay avatar PJ-Finlay commented on May 21, 2024

Very interesting, the coverage penalty looks promising to look into as an enhancement. It could lead to to improved translation quality with a small code change and no model retraining.

from argos-translate.

pierotofy avatar pierotofy commented on May 21, 2024

I had also noticed that the English -> Chinese translation was just providing the English string back.

Came to open an issue about this, but somebody already beat me to it. 🥂

from argos-translate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.