When the seamlessM4T_large model is used for <code cl

You can also check the list in <a href="https://github.com/facebookresearch/seamless_c

You can also check the list in <a href="https://github.com/facebookresear

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Bug in SeamlessM4T-Large: t2tt when target is 'yue' about seamless_communication HOT 13 OPEN

facebookresearch commented on July 16, 2024

Bug in SeamlessM4T-Large: t2tt when target is 'yue'

from seamless_communication.

Comments (13)

freedomtan commented on July 16, 2024 2

Hello, how do I solve this problem !! ModuleNotFoundError: No module named 'seamless_communication' !!

pip install git+https://github.com/facebookresearch/seamless_communication

if you are in jupyter notebook or colab

!pip install git+https://github.com/facebookresearch/seamless_communication

from seamless_communication.

dalyafaraj commented on July 16, 2024

Hello, how do I solve this problem !!
ModuleNotFoundError: No module named 'seamless_communication' !!

from seamless_communication.

chenyunsai commented on July 16, 2024

Where can I see the abbreviation of the language, similar to eng

from seamless_communication.

chenyunsai commented on July 16, 2024

ValueError: lang must be a supported language, but is 'zh-cn' instead.

from seamless_communication.

freedomtan commented on July 16, 2024

Where can I see the abbreviation of the language, similar to eng

In the paper and code in this repo 😀

https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/assets/cards/unity_nllb-100.yaml

https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/assets/cards/unity_nllb-200.yaml

Basically, they are from ISO 639-3

from seamless_communication.

elbayadm commented on July 16, 2024

You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages
For Mandarin Chinese, we have cmn (Hans script) and cmn_Hant (Hant script)

from seamless_communication.

freedomtan commented on July 16, 2024

You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages For Mandarin Chinese, we have cmn (Hans script) and cmn_Hant (Hant script)

some inconsistency between this and the YAML files for medium and large models:

cmn_Hant is only in supported by the large model (or say the large tokenizer)
there is zho_Hant in the medium tokenizer, but it's more like a variant of yue instead of cmn_Hant in the large one, e.g.,

translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng')
print(translated_text)
translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'yue', src_lang='eng')
print(translated_text)

results:

敘利亞總統 Bashar al-Assad 的軍隊喺早上 2 點之後好快擊中. 達馬士革郊區嘅Ghouta 居民話畀記者知,佢哋聽到一個奇怪嘅聲音,就好似有人打開一個<unk>酒瓶一樣. 一位當地醫生,反抗眼淚, 解釋咗好多人喺地下尋求庇護,但氣體比空氣重,而且聚集喺地下室同地下室.
敘利亞總統 Bashar al-Assad 嘅軍隊喺早上 2 點之後好快就擊中咗 達馬士革郊區 Ghouta 嘅居民話畀記者知 佢哋聽到一個奇怪嘅聲音 就好似有人打開一個<unk>酒瓶 一位當地醫生反抗眼淚 佢話好多人喺地下尋求庇護 但氣體比空氣好重 佢哋喺地下室同地下室聚集

Some glyphs in zho_Hant, e.g., 喺 and 嘅, are usually for yue only.

from seamless_communication.

elbayadm commented on July 16, 2024

That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt).
That said, cmn_Hant in the large should be the same as zho_Hant in the medium (speaking of training data).

from seamless_communication.

freedomtan commented on July 16, 2024

That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt). That said, cmn_Hant in the large should be the same as zho_Hant in the medium (speaking of training data).

Nope, something must be wrong. From what returned by translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng'), it's closer to Cantonese (yue) instead of Mandarin (cmn or cmn_Hant). A Mandarin speaker who has no prior knowledge of Cantonese probably will say why there are some random gibberish glyphs :-)

from seamless_communication.

elbayadm commented on July 16, 2024

Thanks @freedomtan for your observations. The major differences between the medium and large models are:

Medium reuses the NLLB-600M distilled model from NLLB
Large uses a new version of NLLB that focuses on the 100 languages of SeamlessM4T (hence nllb-100).
In training this new version of NLLB, we trained another tokenizer while enforcing the addition of frequent Chinese characters (see ¶Training a Text Tokenizer - page 31 of the paper https://ai.meta.com/research/publications/seamless-m4t/) so it should be better for cmn, cmn_Hant and yue. If I'm to believe chrF++ scores, then large is 5 chrF++ points better than medium on Flores eng-> cmn_Hant/zho_Hant

from seamless_communication.

freedomtan commented on July 16, 2024

@elbayadm to summarise what I know

for NLLB-200-600M, the three Chinese languages (zho_Hans, zho_Hant, yue_Hant) work as expected.
for SeamlessM4T medium:
- zho_Hant, which is supposed to return Mandarin in Traditional Han, returned Cantonese in Traditional Han
for SeamlessM4T large:
- yue, which is supposed to return Cantonese in Traditional Han, returned Mandarin in Simplified Han

I guess you do get better chrF++ points, but something in character/glyph-level is wrong.

from seamless_communication.

elbayadm commented on July 16, 2024

@freedomtan
I tested with another example from FLORES-200

import torch
from seamless_communication.models.inference import Translator

translator_medium = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)

message_to_translate = "If you visit the Arctic or Antarctic areas in the winter you will experience the polar night, which means that the sun doesn\'t rise above the horizon."
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from large model: {translated_text}')

from medium model: 如果你喺冬天去訪北極或者南極, 你會經歷極夜, 意思係太陽唔會喺地平線上升.
from large model: 如果你在冬天去北極或南極地區, 你會體驗北極夜,

And:

translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from large model: {translated_text}')

from medium model: 如果你喺冬天去北極或者南極, 你會發現北極嘅夜晚, 即係話太陽唔會喺地平線上升.
from large model: 如果你喺冬天去北極或者南極你會經歷北極嘅夜晚即係話太陽唔喺地平線上面升起

AFAICT:

for SeamlessM4T medium:
zho_Hant, which is supposed to return Mandarin in Traditional Han, returned Cantonese in Traditional Han

is True even for the FLORES example.

for SeamlessM4T large:
yue, which is supposed to return Cantonese in Traditional Han, returned Mandarin in Simplified Han

this one looks like it's correctly translated in Cantonese.
if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output.
I'll investigate this further, to see if the training data is not wrongly labeled.

from seamless_communication.

freedomtan commented on July 16, 2024

@elbayadm Thanks for spending time checking this issue.

this one looks like it's correctly translated in Cantonese. if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output. I'll investigate this further, to see if the training data is not wrongly labeled.

Yes, it turned out that the yue in large model is a bit tricky. I have a shorter example.

to_translate_1 = "The forces of Syria's president, Bashar al-Assad, fight back."
to_translate_2 = "The forces of Syria's president, Bashar al-Assad, fight back soon."

translated_text, _, _ = translator_large.predict(to_translate_1, "t2tt", 'yue', src_lang='eng')
print(f'from large model 1: {translated_text}')
translated_text, _, _ = translator_large.predict(to_translate_2, "t2tt", 'yue', src_lang='eng')
print(f'from large model 2: {translated_text}')

The result of to_translate_1 is in Cantonese in Traditional Han. The result of to_translated_2 is Mandarin in Simplified Han.

from large model 1: 敘利亞總統巴沙爾·阿薩德嘅軍隊反擊
from large model 2: 叙利亚总统巴沙尔·阿萨德 (Bashar al-Assad) 的军队很快就会反击.

from seamless_communication.

Bug in SeamlessM4T-Large: t2tt when target is 'yue' about seamless_communication HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent