Comments (13)
Hello, how do I solve this problem !! ModuleNotFoundError: No module named 'seamless_communication' !!
pip install git+https://github.com/facebookresearch/seamless_communication
if you are in jupyter notebook or colab
!pip install git+https://github.com/facebookresearch/seamless_communication
from seamless_communication.
Hello, how do I solve this problem !!
ModuleNotFoundError: No module named 'seamless_communication' !!
from seamless_communication.
Where can I see the abbreviation of the language, similar to eng
from seamless_communication.
ValueError: lang
must be a supported language, but is 'zh-cn' instead.
from seamless_communication.
Where can I see the abbreviation of the language, similar to eng
In the paper and code in this repo 😀
Basically, they are from ISO 639-3
from seamless_communication.
You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages
For Mandarin Chinese, we have cmn
(Hans script) and cmn_Hant
(Hant script)
from seamless_communication.
You can also check the list in https://github.com/facebookresearch/seamless_communication/tree/main/scripts/m4t/predict#supported-languages For Mandarin Chinese, we have
cmn
(Hans script) andcmn_Hant
(Hant script)
some inconsistency between this and the YAML files for medium and large models:
cmn_Hant
is only in supported by the large model (or say the large tokenizer)- there is
zho_Hant
in the medium tokenizer, but it's more like a variant ofyue
instead ofcmn_Hant
in the large one, e.g.,
translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng')
print(translated_text)
translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'yue', src_lang='eng')
print(translated_text)
results:
敘利亞總統 Bashar al-Assad 的軍隊喺早上 2 點之後好快擊中. 達馬士革郊區嘅Ghouta 居民話畀記者知,佢哋聽到一個奇怪嘅聲音,就好似有人打開一個<unk>酒瓶一樣. 一位當地醫生,反抗眼淚, 解釋咗好多人喺地下尋求庇護,但氣體比空氣重,而且聚集喺地下室同地下室.
敘利亞總統 Bashar al-Assad 嘅軍隊喺早上 2 點之後好快就擊中咗 達馬士革郊區 Ghouta 嘅居民話畀記者知 佢哋聽到一個奇怪嘅聲音 就好似有人打開一個<unk>酒瓶 一位當地醫生反抗眼淚 佢話好多人喺地下尋求庇護 但氣體比空氣好重 佢哋喺地下室同地下室聚集
Some glyphs in zho_Hant
, e.g., 喺 and 嘅, are usually for yue
only.
from seamless_communication.
That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt).
That said, cmn_Hant
in the large should be the same as zho_Hant
in the medium (speaking of training data).
from seamless_communication.
That's correct, the list in the README is for the large model only. The medium model's tokenizer actually supports more languages (it's the same tokenizer as NLLB-200 with the langs in https://github.com/facebookresearch/fairseq/blob/nllb/examples/nllb/modeling/scripts/flores200/langs.txt). That said,
cmn_Hant
in the large should be the same aszho_Hant
in the medium (speaking of training data).
Nope, something must be wrong. From what returned by translated_text, _, _ = translator_medium.predict(to_translate, "t2tt", 'zho_Hant', src_lang='eng')
, it's closer to Cantonese (yue
) instead of Mandarin (cmn
or cmn_Hant
). A Mandarin speaker who has no prior knowledge of Cantonese probably will say why there are some random gibberish glyphs :-)
from seamless_communication.
Thanks @freedomtan for your observations. The major differences between the medium and large models are:
- Medium reuses the NLLB-600M distilled model from NLLB
- Large uses a new version of NLLB that focuses on the 100 languages of SeamlessM4T (hence nllb-100).
In training this new version of NLLB, we trained another tokenizer while enforcing the addition of frequent Chinese characters (see ¶Training a Text Tokenizer - page 31 of the paper https://ai.meta.com/research/publications/seamless-m4t/) so it should be better forcmn
,cmn_Hant
andyue
. If I'm to believe chrF++ scores, then large is 5 chrF++ points better than medium on Floreseng
->cmn_Hant
/zho_Hant
from seamless_communication.
@elbayadm to summarise what I know
- for NLLB-200-600M, the three Chinese languages (
zho_Hans
,zho_Hant
,yue_Hant
) work as expected. - for SeamlessM4T medium:
zho_Hant
, which is supposed to return Mandarin in Traditional Han, returned Cantonese in Traditional Han
- for SeamlessM4T large:
yue
, which is supposed to return Cantonese in Traditional Han, returned Mandarin in Simplified Han
I guess you do get better chrF++ points, but something in character/glyph-level is wrong.
from seamless_communication.
@freedomtan
I tested with another example from FLORES-200
import torch
from seamless_communication.models.inference import Translator
translator_medium = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
message_to_translate = "If you visit the Arctic or Antarctic areas in the winter you will experience the polar night, which means that the sun doesn\'t rise above the horizon."
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'cmn_Hant', src_lang='eng')
print(f'from large model: {translated_text}')
from medium model: 如果你喺冬天去訪北極或者南極, 你會經歷極夜, 意思係太陽唔會喺地平線上升.
from large model: 如果你在冬天去北極或南極地區, 你會體驗北極夜,
And:
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
print(f'from large model: {translated_text}')
from medium model: 如果你喺冬天去北極或者南極, 你會發現北極嘅夜晚, 即係話太陽唔會喺地平線上升.
from large model: 如果你喺冬天去北極或者南極 你會經歷北極嘅夜晚 即係話太陽唔喺地平線上面升起
AFAICT:
for SeamlessM4T medium:
zho_Hant, which is supposed to return Mandarin in Traditional Han, returned Cantonese in Traditional Han
is True even for the FLORES example.
for SeamlessM4T large:
yue, which is supposed to return Cantonese in Traditional Han, returned Mandarin in Simplified Han
this one looks like it's correctly translated in Cantonese.
if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output.
I'll investigate this further, to see if the training data is not wrongly labeled.
from seamless_communication.
@elbayadm Thanks for spending time checking this issue.
this one looks like it's correctly translated in Cantonese. if you agree with my assessment, then the issue in your first example could be caused by the code switching with English in the output. I'll investigate this further, to see if the training data is not wrongly labeled.
Yes, it turned out that the yue
in large model is a bit tricky. I have a shorter example.
to_translate_1 = "The forces of Syria's president, Bashar al-Assad, fight back."
to_translate_2 = "The forces of Syria's president, Bashar al-Assad, fight back soon."
translated_text, _, _ = translator_large.predict(to_translate_1, "t2tt", 'yue', src_lang='eng')
print(f'from large model 1: {translated_text}')
translated_text, _, _ = translator_large.predict(to_translate_2, "t2tt", 'yue', src_lang='eng')
print(f'from large model 2: {translated_text}')
The result of to_translate_1
is in Cantonese in Traditional Han. The result of to_translated_2
is Mandarin in Simplified Han.
from large model 1: 敘利亞總統巴沙爾·阿薩德嘅軍隊反擊
from large model 2: 叙利亚总统巴沙尔·阿萨德 (Bashar al-Assad) 的军队很快就会反击.
from seamless_communication.
Related Issues (20)
- finetune results HOT 4
- bug
- unknown result in t2tt HOT 1
- artifact model seamless expression
- Translated from Chinese to English, nitrogen converts hydrogen gas, oxygen converts oxidation, so terrible
- فروشگاهپرشینفیلتر
- Why always Downloading the tokenizer of seamlessM4T_v2_large HOT 1
- Where can I set the max input length
- !!!!!!!!!Bug RuntimeError: expected scalar type Half but found Float $$$$$$$$$$ HOT 9
- Request for Enterprise Use
- not Fixed in #400
- How to use speech to text predit in batch by batch not one audio
- Could the demo run in windows 11?
- Error with GPU 40G !!!!!!!!! HOT 1
- Error with GPU 40GB !!!!!!!!! HOT 6
- use orginal .pth or FineTuned .pth
- FineTune Errror TEXT_TO_SPEECH Same lang
- .pth file
- NotImplementedError: T2U finetuning implemented only for UnitYT2UModel why??!!
- How to segment hate speech downloaded from the Mutox dataset tsv file HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seamless_communication.