Giter Site home page Giter Site logo

facebookresearch / seamless_communication Goto Github PK

View Code? Open in Web Editor NEW
10.5K 139.0 1.0K 52.85 MB

Foundational Models for State-of-the-Art Speech and Text Translation

License: Other

Python 9.67% CMake 0.32% Makefile 0.01% Zig 0.26% Shell 0.07% C++ 2.91% C 13.14% Cuda 2.03% Objective-C 0.49% Metal 0.62% Jupyter Notebook 70.47%

seamless_communication's Introduction

Seamless Intro

Seamless is a family of AI models that enable more natural and authentic communication across languages. SeamlessM4T is a massive multilingual multimodal machine translation model supporting around 100 languages. SeamlessM4T serves as foundation for SeamlessExpressive, a model that preserves elements of prosody and voice style across languages and SeamlessStreaming, a model supporting simultaneous translation and streaming ASR for around 100 languages. SeamlessExpressive and SeamlessStreaming are combined into Seamless, a unified model featuring multilinguality, real-time and expressive translations.

Links

Demos

SeamlessM4T v2 SeamlessExpressive SeamlessStreaming
Demo SeamlessM4T v2 Demo SeamlessExpressive Demo
HuggingFace Space Demo 🤗 SeamlessM4T v2 Space 🤗 SeamlessExpressive Space 🤗 SeamlessStreaming Space

Papers

Seamless

EMMA

SONAR

Blog

AI at Meta Blog

Tutorial

An exhaustive tutorial given at the NeurIPS 2023 - Seamless EXPO, which is a one-stop shop to learn how to use the entire suite of Seamless models. Please feel free to play with the notebook.

SeamlessM4T

SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages.

SeamlessM4T models support the tasks of:

  • Speech-to-speech translation (S2ST)
  • Speech-to-text translation (S2TT)
  • Text-to-speech translation (T2ST)
  • Text-to-text translation (T2TT)
  • Automatic speech recognition (ASR)

🌟 We are releasing SeamlessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference latency in speech generation tasks.

To learn more about the collection of SeamlessM4T models, the approach used in each, their language coverage and their performance, visit the SeamlessM4T README or 🤗 Model Card.

Note

Seamless M4T is also available in the 🤗 Transformers library. Visit this section for more details.

SeamlessExpressive

SeamlessExpressive is a speech-to-speech translation model that captures certain underexplored aspects of prosody such as speech rate and pauses, while preserving the style of one's voice and high content translation quality.

To learn more about SeamlessExpressive models, visit the SeamlessExpressive README or 🤗 Model Card

SeamlessStreaming

SeamlessStreaming is a streaming translation model. The model supports speech as input modality and speech/text as output modalities.

The SeamlessStreaming model supports the following tasks:

  • Speech-to-speech translation (S2ST)
  • Speech-to-text translation (S2TT)
  • Automatic speech recognition (ASR)

To learn more about SeamlessStreaming models, visit the SeamlessStreaming README or 🤗 Model Card

Seamless

The Seamless model is the unified model for expressive streaming speech-to-speech translations.

What's new

  • [12/18/2023] We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.
  • [12/14/2023] We are releasing the Seamless tutorial given at NeurIPS 2023.

Quick Start

Installation

Note

One of the prerequisites is fairseq2 which has pre-built packages available only for Linux x86-64 and Apple-silicon Mac computers. In addition it has a dependency on libsndfile which might not be installed on your machine. If you experience any installation issues, please refer to its README for further instructions.

pip install .

Note

Transcribing inference audio for computing metric uses Whisper, which is automatically installed. Whisper in turn requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers.

Running inference

SeamlessM4T Inference

Here’s an example of using the CLI from the root directory to run inference.

S2ST task:

m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio>

T2TT task:

m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>

Please refer to the inference README for detailed instruction on how to run inference and the list of supported languages on the source, target sides for speech, text modalities.

For running S2TT/ASR natively (without Python) using GGML, please refer to the unity.cpp section.

SeamlessExpressive Inference

Note

Please check the section on how to download the model.

Here’s an example of using the CLI from the root directory to run inference.

expressivity_predict <path_to_input_audio> --tgt_lang <tgt_lang> --model_name seamless_expressivity --vocoder_name vocoder_pretssel --output_path <path_to_save_audio>

SeamlessStreaming and Seamless Inference

Streaming Evaluation README has detailed instructions for running evaluations for the SeamlessStreaming and Seamless models. The CLI has an --no-scoring option that can be used to skip the scoring part and just run inference.

Please check the inference README for more details.

Running SeamlessStreaming Demo

You can duplicate the SeamlessStreaming HF space to run the streaming demo.

You can also run the demo locally, by cloning the space from here. See the README of the SeamlessStreaming HF repo for more details on installation.

Running SeamlessM4T & SeamlessExpressive Gradio demos locally

To launch the same demo Space we host on Hugging Face locally:

cd demo
pip install -r requirements.txt
python app.py

Resources and usage

Model

SeamlessM4T models

Model Name #params checkpoint metrics
SeamlessM4T-Large v2 2.3B 🤗 Model card - checkpoint metrics
SeamlessM4T-Large (v1) 2.3B 🤗 Model card - checkpoint metrics
SeamlessM4T-Medium (v1) 1.2B 🤗 Model card - checkpoint metrics

SeamlessExpressive models

🤗 Model card

To access and download SeamlessExpressive, please request the model artifacts through this request form. Upon approval, you will then receive an email with download links to each model artifact.

Please note that SeamlessExpressive is made available under its own License and Acceptable Use Policy.

SeamlessStreaming models

Model Name #params checkpoint metrics
SeamlessStreaming 2.5B 🤗 Model card - monotonic decoder checkpoint - streaming UnitY2 checkpoint metrics

Seamless models

Seamless model is simply the SeamlessStreaming model with the non-expressive vocoder_v2 swapped out with the expressive vocoder_pretssel. Please check out above section on how to acquire vocoder_pretssel checkpoint.

W2v-BERT 2.0 speech encoder

Model Name #params checkpoint
W2v-BERT 2.0 600M 🤗 Model card - checkpoint

Here's how you should do a foward pass through the speech encoder:

import torch

from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from fairseq2.data import Collater
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model


audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
    num_mel_bins=80,
    waveform_scale=2**15,
    channel_last=True,
    standardize=True,
    device=device,
    dtype=dtype,
)
collater = Collater(pad_value=1)

model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()

with Path(audio_wav_path).open("rb") as fb:
    block = MemoryBlock(fb.read())

decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)

with torch.inference_mode():
  seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
  seqs, padding_mask = model.encoder(seqs, padding_mask)

Evaluation

SeamlessM4T Evaluation

To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the README here.

SeamlessExpressive Evaluation

Below is the script for efficient batched evaluation.

export MODEL_DIR="/path/to/SeamlessExpressive/model"
export TEST_SET_TSV="input.tsv" # Your dataset in a TSV file, with headers "id", "audio"
export TGT_LANG="spa" # Target language to translate into, options including "fra", "deu", "eng" ("cmn" and "ita" are experimental)
export OUTPUT_DIR="tmp/" # Output directory for generated text/unit/waveform
export TGT_TEXT_COL="tgt_text" # The column in your ${TEST_SET_TSV} for reference target text to calcuate BLEU score. You can skip this argument.
export DFACTOR="1.0" # Duration factor for model inference to tune predicted duration (preddur=DFACTOR*preddur) per each position which affects output speech rate. Greater value means slower speech rate (default to 1.0). See expressive evaluation README for details on duration factor we used.
expressivity_evaluate ${TEST_SET_TSV} \
  --gated-model-dir ${MODEL_DIR} --task s2st --tgt_lang ${TGT_LANG} \
  --audio_root_dir "" --output_path ${OUTPUT_DIR} --ref_field ${TGT_TEXT_COL} \
  --model_name seamless_expressivity --vocoder_name vocoder_pretssel \
  --text_unk_blocking True --duration_factor ${DFACTOR}

Please check out this README section

SeamlessStreaming and Seamless Evaluation

Streaming Evaluation README has detailed instructions for running evaluations on the SeamlessStreaming and Seamless models.

Unity.cpp

To enable Seamless Communication Everywhere, we implemented unity.cpp so users could run SeamlessM4T models in GGML - a C tensor library allowing easier integration on verbose platforms.

To transcribe/translte a given audio,

./ggml/bin/unity --model seamlessM4T_medium.ggml input.wav

For details of build and more usage please check out unity.cpp

Expressive Datasets

We created two expressive speech-to-speech translation datasets, mExpresso and mDRAL, between English and five other languages -- French, German, Italian, Mandarin and Spanish. We currently open source the speech-to-text of mExpresso for out-of-English directions, and we will open source the remaining part of the datasets soon. For details, please check out README

SeamlessAlignExpressive

We’re introducing the first expressive speech alignment procedure. Starting with raw data, the expressive alignment procedure automatically discovers pairs of audio segments sharing not only the same meaning, but the same overall expressivity. To showcase this procedure, we are making metadata available to create a benchmarking dataset called SeamlessAlignExpressive, that can be used to validate the quality of our alignment method. SeamlessAlignExpressive is the first large-scale (11k+ hours) collection of multilingual audio alignments for expressive translation. More details can be found on the SeamlessAlignExpressive README.

Converting raw audio to units

Please check out the README here. Note that SeamlessM4T v1 model uses reduced units and other models use non-reduced units.

Libraries

Seamless Communication depends on 4 libraries developed by Meta.

fairseq2 is our next-generation open-source library of sequence modeling components that provides researchers and developers with building blocks for machine translation, language modeling, and other sequence generation tasks. All SeamlessM4T models in this repository are powered by fairseq2.

SONAR, Sentence-level multimOdal and laNguage-Agnostic Representations is a new multilingual and -modal sentence embedding space which outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. SONAR provides text and speech encoders for many languages. SeamlessAlign was mined based on SONAR embeddings.

BLASER 2.0 is our latest model-based evaluation metric for multimodal translation. It is an extension of BLASER, supporting both speech and text. It operates directly on the source signal, and as such, does not require any intermediate ASR system like ASR-BLEU. As in the first version, BLASER 2.0 leverages the similarity between input and output sentence embeddings. SONAR is the underlying embedding space for BLASER 2.0. Scripts to run evaluation with BLASER 2.0 can be found in the SONAR repo.

As part of the seamless communication project, we've extended the stopes library. Version 1 provided a text-to-text mining tool to build training dataset for translation models. Version 2 has been extended thanks to SONAR, to support tasks around training large speech translation models. In particular, we provide tools to read/write the fairseq audiozip datasets and a new mining pipeline that can do speech-to-speech, text-to-speech, speech-to-text and text-to-text mining, all based on the new SONAR embedding space.

SimulEval is a library used for evaluating simulaneous translation models. SimulEval also provides a backend for generation using partial/incremental inputs with flexible/extensible states, which is used to implement streaming inference. Users define agents which implement SimulEval's interface, which can be connected together in a pipeline. You can find agents implemented for SeamlessStreaming here.

[Legacy] SeamlessM4T v1 instructions

Finetuning SeamlessM4T v1 models

Please check out the README here.

On-device models

Apart from Seamless-M4T large (2.3B) and medium (1.2B) models, we are also releasing a small model (281M) targeted for on-device inference. To learn more about the usage and model details check out the README here.

SeamlessAlign mined dataset

We open-source the metadata to SeamlessAlign, the largest open dataset for multimodal translation, totaling 270k+ hours of aligned Speech and Text data. The dataset can be rebuilt by the community based on the SeamlessAlign readme.

Citation

If you use Seamless in your work or any models/datasets/artifacts published in Seamless, please cite :

@inproceedings{seamless2023,
   title="Seamless: Multilingual Expressive and Streaming Speech Translation",
   author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
  journal={ArXiv},
  year={2023}
}

License

We have three license categories.

The following non-generative components are MIT licensed as found in MIT_LICENSE:

The following models are CC-BY-NC 4.0 licensed as found in the LICENSE:

  • SeamlessM4T models (v1 and v2).
  • SeamlessStreaming models.

The following models are Seamless licensed as found in SEAMLESS_LICENSE:

  • Seamless models.
  • SeamlessExpressive models.

seamless_communication's People

Contributors

abizer avatar am831 avatar annasun28 avatar avidale avatar cbalioglu avatar celebio avatar cndn avatar elbayadm avatar eltociear avatar friendlynokill avatar gwenzek avatar hysts avatar jmp84 avatar kauterry avatar mavlyutovr avatar mjhwang93 avatar mortimerp9 avatar neon-20 avatar nguyenvulong avatar nirantk avatar oheed911 avatar ovidijusparsiunas avatar rohitmidha23 avatar sanchit-gandhi avatar ubaldus avatar uralik avatar vaibhavs10 avatar yeyinthtoon avatar ylacombe avatar zrthxn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seamless_communication's Issues

Error with running predict.py

I have some problem while execute script:

❯ python3 predict.py /home/jambo/Downloads/bbb.mp3 s2st ukr --output_path au.mp3 --model_name seamlessM4T_large
INFO:__main__:Running inference on the GPU.
Downloading the checkpoint of the model 'seamlessM4T_large'...
100%|████████████████████████████████████████████████████████████████████████████████| 10.7G/10.7G [31:36<00:00, 6.03MB/s]
Downloading the tokenizer of the model 'seamlessM4T_large'...
100%|████████████████████████████████████████████████████████████████████████████████| 4.93M/4.93M [00:00<00:00, 6.06MB/s]
Downloading the checkpoint of the model 'vocoder_36langs'...
100%|██████████████████████████████████████████████████████████████████████████████████| 160M/160M [00:27<00:00, 6.06MB/s]
Traceback (most recent call last):
  File "/home/jambo/Documents/myown/python/projects/seamless_communication/src/predict.py", line 86, in <module>
    main()
  File "/home/jambo/Documents/myown/python/projects/seamless_communication/src/predict.py", line 67, in main
    translated_text, wav, sr = translator.predict(
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/projects/seamless_communication/src/seamless_communication/models/inference/translator.py", line 209, in predict
    result = self.get_prediction(
  File "/home/jambo/Documents/myown/python/projects/seamless_communication/src/seamless_communication/models/inference/translator.py", line 120, in get_prediction
    return generator(
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/projects/seamless_communication/src/seamless_communication/models/unity/generator.py", line 173, in __call__
    text_output = self.s2t_generator.generate_ex(source_seqs, source_seq_lens)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/fairseq2/generation/text.py", line 155, in generate_ex
    return self._do_generate(source_seqs, source_seq_lens)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/fairseq2/generation/text.py", line 71, in _do_generate
    encoder_output, encoder_padding_mask = self.model.encode(
  File "/home/jambo/Documents/myown/python/projects/seamless_communication/src/seamless_communication/models/unity/model.py", line 191, in encode
    return self.encoder(seqs, padding_mask)  # type: ignore[no-any-return]
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/projects/seamless_communication/src/seamless_communication/models/unity/adaptor_block.py", line 104, in forward
    seqs, padding_mask = self.inner(seqs, padding_mask, layer_output_hook)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/fairseq2/nn/transformer/encoder.py", line 155, in forward
    seqs, padding_mask = layer(seqs, padding_mask)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/fairseq2/models/conformer/block.py", line 124, in forward
    seqs = self._forward_conv(seqs, padding_mask)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/fairseq2/models/conformer/block.py", line 169, in _forward_conv
    seqs = self.conv(seqs, padding_mask)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/fairseq2/models/conformer/convolution.py", line 113, in forward
    seqs = self.pointwise_conv1(seqs)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/jambo/Documents/myown/python/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size

run time error

$m4t_predict hello how are you t2st eng --src_lang eng --output_path .
m4t_predict: command not found

I am trying to run this but I am not able to get the result. how do I get this working

libsndfile Error

I first cloned the repo and then created a conda environment.

Afterwards I ran following two commands as stated in Installation Section:

pip install .
conda install -y -c conda-forge libsndfile

Installation completes without and error. However if I run following code:

python3 scripts/m4t/predict/predict.py "Teknolojiyi merkezine alan yenilikçi yapımızla hayatını kolaylaştırmaya devam ediyoruz." t2tt eng --src_lang tur

I get following output:

Traceback (most recent call last):
  File "/home/guvenc/seamless_communication/scripts/m4t/predict/predict.py", line 10, in <module>
    from seamless_communication.models.inference import Translator
  File "/home/guvenc/.local/lib/python3.10/site-packages/seamless_communication/models/inference/__init__.py", line 6, in <module>
    from seamless_communication.models.inference.translator import Translator as Translator
  File "/home/guvenc/.local/lib/python3.10/site-packages/seamless_communication/models/inference/translator.py", line 12, in <module>
    from fairseq2.data import Collater
  File "/home/guvenc/.local/lib/python3.10/site-packages/fairseq2/data/__init__.py", line 7, in <module>
    from fairseq2.data.cstring import CString as CString
  File "/home/guvenc/.local/lib/python3.10/site-packages/fairseq2/data/cstring.py", line 58, in <module>
    from fairseq2n.bindings.data.string import CString as CString
  File "/home/guvenc/.local/lib/python3.10/site-packages/fairseq2n/__init__.py", line 90, in <module>
    _load_sndfile()
  File "/home/guvenc/.local/lib/python3.10/site-packages/fairseq2n/__init__.py", line 80, in _load_sndfile
    raise OSError(
OSError: libsndfile is not found!. Use your system package manager to install it (e.g. `apt install libsndfile1`).

I use Ubuntu based Pop!_OS.

sudo apt install libsndfile1 outputs:

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libsndfile1 is already the newest version (1.0.31-2build1).
The following packages were automatically installed and are no longer required:
  golang-1.18-go golang-1.18-src golang-src
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 250 not upgraded.

Installing libsndfile (https://github.com/libsndfile/libsndfile) from source didn't help either.

I wonder what I might be doing wrong.
Thanks!

Fine-tuning on single GPU

Is it possible to fine tune this in limited hardware environment, like a single 3090?

Any thoughts in Lora Implementation?

CPU

Can I run this on CPU only, in a windows machine?

out of ram with On-device Models

Hello
My computer is a fanless Linux Mint (latest version) with 6Go of ram
with no other programs runing
free -h :
total used free shared buff/cache available
Mem: 5,6Gi 398Mi 4,2Gi 109Mi 1,0Gi 4,9Gi
Swap: 2,0Gi 350Mi 1,7Gi

This code gets killed and htop shows the ram and the swap usage going 100%
Any help please ?

import torchaudio
import torch

TEST_AUDIO_PATH = "jfk.wav"
TGT_LANG = "eng"
audio_input, _ = torchaudio.load(TEST_AUDIO_PATH) # Load waveform using torchaudio
s2t_model = torch.jit.load("unity_on_device_s2t.ptl") # Load exported S2T model
text = s2t_model(audio_input, tgt_lang=TGT_LANG) # Forward call with tgt_lang specified for ASR or S2TT
print(text)

Bug in SeamlessM4T-Large: t2tt when target is 'yue'

When the seamlessM4T_large model is used for t2tt, tgt_lang='yue', src_lang='eng', the returned results are in Mandarin with Simplified Han glyphs (expected results are in Cantonese with Traditional Han glyphs.

import torch
from seamless_communication.models.inference import Translator

translator_medium = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)

message_to_translate = 'The forces of Syria’s president, Bashar al-Assad, struck soon after 2am. Residents of Ghouta, a Damascus suburb, told reporters that they heard a strange noise, as if someone was opening a bottle of Pepsi. A local doctor, fighting back tears, explained that many people had sought shelter underground, but the gas was heavier than air and it pooled in basements and cellars. Had they climbed the stairs instead, they would have lived.'
translated_text, _, _ = translator_medium.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
# with medium size model, we got expected Cantonese contents in Tradiditional Han glyphs
print(f'from medium model: {translated_text}')
translated_text, _, _ = translator_large.predict(message_to_translate, "t2tt", 'yue', src_lang='eng')
# with large size model, we got Mandarin contents in Simplified Han glyphs (NOT expected yue in Traditional Han script)
print(f'from large model: {translated_text}')

The results

from medium model: 敘利亞總統 Bashar al-Assad 嘅軍隊喺早上 2 點之後好快就擊中咗 達馬士革郊區 Ghouta 嘅居民話畀記者知 佢哋聽到一個奇怪嘅聲音 就好似有人打開一個<unk>酒瓶 一位當地醫生反抗眼淚 佢話好多人喺地下尋求庇護 但氣體比空氣好重 佢哋喺地下室同地下室聚集
from large model: 叙利亚总统巴沙尔·阿萨德 (Bashar al-Assad) 的军队在凌晨 2 点袭击. 大马士革郊区古塔 (Ghouta) 的居民告诉记者,他们听到一个奇怪的噪音,好像有人在打开百事可乐的瓶子. 一个当地医生,控制着眼泪,解释说许多人寻求地下避难所,但气体比空气更重,它聚集在地下室和地下室.如果他们爬上楼梯,他们会活下来.

Error when fine-tuning the model: ModuleNotFoundError: No module named 'fairseq2.models.unity'

Hello, I am following the fine-tuning guide with the following command:

torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc-per-node=8 --no-python python finetune.py --mode SPEECH_TO_SPEECH --train_dataset ./m4t_dataset/train_manifest.json --eval_dataset ./m4t_dataset/validation_manifest.json --learning_rate 1e-6 --warmup_steps 100 --max_epochs 10 --patience 3 --model_name seamlessM4T_large --save_model_to ./m4t_dataset/checkpoint.pt

However, this happens:



Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/finetune.py", line 16, in <module>
    import trainer
  File "/home/privateserver/Coding/seamless_communication/scripts/m4t/finetune/trainer.py", line 21, in <module>
    from fairseq2.models.unity import UnitYModel
ModuleNotFoundError: No module named 'fairseq2.models.unity'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 87402) of binary: python
Traceback (most recent call last):
  File "/home/privateserver/Coding/seamless_communication/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/privateserver/Coding/seamless_communication/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/privateserver/Coding/seamless_communication/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/privateserver/Coding/seamless_communication/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/privateserver/Coding/seamless_communication/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/privateserver/Coding/seamless_communication/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
python FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 87403)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 87404)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 87405)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 87406)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 87407)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 87408)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 87409)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-23_01:55:36
  host      : privateserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 87402)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Can someone tell me how to fix this? Thank you

Missing asset card

After several adjustments and adaptations in the repository, I was able to compile the source code in Ubuntu.
However, the execution gives an error because the program does not find the CARD of the model.
I imagine that the documentation is missing some step on how to setup for inference only.

(voz) astro@ubuntu:~/dev/voz/fix/seamless_communication$ m4t_predict  "seu tolo, nao sei de nada" t2tt en --src_lang pt
2023-08-22 22:46:12,134 INFO -- m4t_scripts.predict.predict: Running inference on the CPU.
Traceback (most recent call last):
  File "/home/astro/dev/voz/fix/fairseq2/src/fairseq2/assets/card_storage.py", line 76, in load_card
    fp = open(pathname)
FileNotFoundError: [Errno 2] No such file or directory: '/home/astro/dev/voz/voz/lib/python3.8/site-packages/seamless_communication/assets/cards/seamlessM4T_large.yaml'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/astro/dev/voz/voz/bin/m4t_predict", line 8, in <module>
    sys.exit(main())
  File "/home/astro/dev/voz/voz/lib/python3.8/site-packages/m4t_scripts/predict/predict.py", line 69, in main
    translator = Translator(args.model_name, args.vocoder_name, device)
  File "/home/astro/dev/voz/voz/lib/python3.8/site-packages/seamless_communication/models/inference/translator.py", line 60, in __init__
    self.model: UnitYModel = load_unity_model(
  File "/home/astro/dev/voz/fix/fairseq2/src/fairseq2/models/utils/model_loader.py", line 175, in __call__
    card = self.asset_store.retrieve_card(model_name_or_card)
  File "/home/astro/dev/voz/fix/fairseq2/src/fairseq2/assets/store.py", line 101, in retrieve_card
    data = self._storage.load_card(name)
  File "/home/astro/dev/voz/fix/fairseq2/src/fairseq2/assets/card_storage.py", line 78, in load_card
    raise AssetCardNotFoundError(
fairseq2.assets.card_storage.AssetCardNotFoundError: An asset card with the name 'seamlessM4T_large' cannot be found.

failed in asr task

I try to test asr task in cli, but failed, do I miss anything?

$m4t_predict --model seamlessM4T_medium 16k.wav asr eng
2023-08-23 16:17:41,203 INFO -- m4t_scripts.predict.predict: Running inference on the GPU.
Using the cached checkpoint of the model 'seamlessM4T_medium'. Set force=True to download again.
Using the cached tokenizer of the model 'seamlessM4T_medium'. Set force=True to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set force=True to download again.
Traceback (most recent call last):
....
File "/home/kaisermac/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kaisermac/miniconda3/lib/python3.11/site-packages/fairseq2/nn/transformer/relative_attention.py", line 293, in forward
raise ValueError(
ValueError: The input sequence length must be less than or equal to the maximum sequence length (4096), but is 16272 instead.

Albanian language support

Hi there,
I see that Albanian is not on the list of supported languages. What can be done by anyone from the Albanian speaking community to have our language supported?

OSError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory

OS: ubuntu 18.04
Python: 3.9
PyTorch: 2.0.1+cu117

Installation as follow sucessfully.

pip install .
conda install -y -c conda-forge libsndfile

run with error:

m4t_predict <input_text> t2tt <tgt_lang> --src_lang <src_lang>

Traceback (most recent call last):
File "/data/seamless_communication/build/lib/m4t_scripts/predict/predict.py", line 9, in
import torchaudio
File "/home/ai/anaconda3/envs/cpm39/lib/python3.9/site-packages/torchaudio/init.py", line 1, in
from torchaudio import ( # noqa: F401
File "/home/ai/anaconda3/envs/cpm39/lib/python3.9/site-packages/torchaudio/_extension.py", line 135, in
_init_extension()
File "/home/ai/anaconda3/envs/cpm39/lib/python3.9/site-packages/torchaudio/_extension.py", line 105, in _init_extension
_load_lib("libtorchaudio")
File "/home/ai/anaconda3/envs/cpm39/lib/python3.9/site-packages/torchaudio/_extension.py", line 52, in _load_lib
torch.ops.load_library(path)
File "/home/ai/anaconda3/envs/cpm39/lib/python3.9/site-packages/torch/_ops.py", line 643, in load_library
ctypes.CDLL(path)
File "/home/ai/anaconda3/envs/cpm39/lib/python3.9/ctypes/init.py", line 382, in init
self._handle = _dlopen(self._name, mode)
OSError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory

CoreML version

Anyone try to optimize this PyTorch model for the Apple neural engine? Creating a CoreML model?

Hardware Requirements for Deploying seamless_communication on a Linux Server

Hello,

I'm interested in deploying the seamless_communication project on my own Linux server. Before proceeding, I'd like to ensure that my server meets the necessary hardware requirements.

Could you please provide details on the recommended or minimum hardware specifications for running this project? Specifically, I'm looking for information on:

CPU requirements (e.g., number of cores, speed)
RAM size
GPU requirements (if applicable)
Disk space
I've gone through the documentation, but I couldn't find specific details regarding these hardware aspects. Any guidance or best practices related to hardware would be greatly appreciated.

Thank you for your time and assistance!

Best regards,

seamless communication on raspberry pi 4 or Nvidia Jetson

Is it possible to use pretrained model for inference on raspberry pi 4 or Nvidia Jetson?
What is the latest hardware requirements to use on boards?
Anyone can help me, or do you have such experience to use pretrained model like this on boards?

m4t s2tt produce bad quality transciption

on a colab GPU instance, I setup m4t runtime env and try a s2tt task. It produce bad quality transcription as follows compared whisper. I wonder if I have been doing something wrong on setup seamless m4t. My source voice is attached as a zip to this post.

/content/seamless_communication# m4t_predict japanweather.wav s2tt jpn
2023-08-23 12:32:11,231 INFO -- m4t_scripts.predict.predict: Running inference on the GPU.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set force=True to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set force=True to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set force=True to download again.
2023-08-23 12:33:46,030 INFO -- m4t_scripts.predict.predict: Translated text in jpn: 台風の最新情報は二十八日三時頃に**の西海を伴う注意必要な状況もありました ⁇ 特に台風に向かって強い風力が続く状況もあります ⁇
/content/seamless_communication#

while using whisper:

whisper japanweather.wav
Detecting language using up to the first 30 seconds. Use --language to specify the language
Detected language: Japanese
[00:00.000 --> 00:03.680] 予防センターから台風の最新情報をお伝えいたします
[00:03.680 --> 00:07.240] 大型で強い台風5号は28日3時現在
[00:07.240 --> 00:11.040] **の西の海上を北に時速20キロで済んでいます
[00:11.040 --> 00:14.440] 中心の気圧は955ヘクトパスカル
[00:14.440 --> 00:17.240] 中心吹きの最大風速は40メートルです
[00:17.240 --> 00:19.760] この後も北上を続けまして
[00:19.760 --> 00:23.840] 28日のうちに**大陸に上陸する見通しです
[00:23.840 --> 00:27.760] 大陸に上陸した後は急速に成力を弱めまして
[00:27.760 --> 00:32.400] 29日には熱帯的やつに変わると見られます
[00:32.400 --> 00:35.440] まだ強い制御庫を保っているということもありまして
[00:35.440 --> 00:40.320] 沖縄周辺、特に先島方面では風が強まるような状況です
[00:40.320 --> 00:45.680] 平均で15メートルを超えるような風の強い状況となることも考えられます
[00:45.680 --> 00:50.320] 恐怖や高波などには引き続き注意が必要といった状況です
[00:50.320 --> 00:53.640] また台風に向かって湿った空気が流れ込む影響で
[00:53.640 --> 00:55.920] 沖縄方面、先島だけではなくて
[00:55.920 --> 01:00.000] 沖縄は本当エリアにも甘くものがかかりやすい状況が続きます
[01:00.000 --> 01:01.760] 短時間ではありますけれども
[01:01.760 --> 01:03.880] 雨がざっと降るようなこともありますし
[01:03.880 --> 01:06.040] 雷の友だう心配もありますので
[01:06.040 --> 01:11.680] 雨や風、そして高波には引き続き注意が必要と言えそうです
[01:11.680 --> 01:14.680] 以上台風に関する情報をお伝えいたしました

japanweather.wav.zip

The installation cannot be competed

PIP version: pip 23.2.1
Python version: python 3.10
Error message after running pip install fairseq2==0.1:

Collecting fairseq2==0.1
  Obtaining dependency information for fairseq2==0.1 from https://files.pythonhosted.org/packages/cd/27/46c14e28e8cb0aa602660ce64d4547a37f460d382e4fcf94f2a53d47e5b0/fairseq2-0.1.0-py3-none-any.whl.metadata
  Using cached fairseq2-0.1.0-py3-none-any.whl.metadata (1.2 kB)
INFO: pip is looking at multiple versions of fairseq2 to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement fairseq2n==0.1.0 (from fairseq2) (from versions: none)
ERROR: No matching distribution found for fairseq2n==0.1.0

Finetuning on custom dataset for ASR

Hi, I have a custom dataset of one language with a csv files of the labels and filepaths, as well as a directory of audio files.

Can anyone suggest what steps they took to finetune the model (especially for a monolingua ASR task), as I am unclear how to prepare the local dataset given that I would not have the manifest and many other details present in the fleurs dataset in the example finetuning

Any help is greatly appreciated!

Install issue on ARM64 / embedded GPU

Have same installation issue on a ARM64 with NVIDIA GPU, 16 GB or 64 GB, CUDA 11.2.

pip install --verbose --trusted-host fair-package-repo.s3-website-us-east-1.amazonaws.com --extra-index-url http://fair-package-repo.s3-website-us-east-1.amazonaws.com/fairseq2/whl/stable/pt2.0.1/cu118 fairseq2 --verbose
Using pip 23.2.1 from /opt/ssd700gb/venv/lib/python3.8/site-packages/pip (python 3.8)
Non-user install because user site-packages disabled
Created temporary directory: /tmp/pip-build-tracker-1winxm5u
Initialized build tracking at /tmp/pip-build-tracker-1winxm5u
Created build tracker: /tmp/pip-build-tracker-1winxm5u
Entered build tracker: /tmp/pip-build-tracker-1winxm5u
Created temporary directory: /tmp/pip-install-0uco4p7o
Created temporary directory: /tmp/pip-ephem-wheel-cache-t8of0zzg
Looking in indexes: https://pypi.org/simple, http://fair-package-repo.s3-website-us-east-1.amazonaws.com/fairseq2/whl/stable/pt2.0.1/cu118
2 location(s) to search for versions of fairseq2:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/ssd700gb/venv/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 180, in exc_logging_wrapper
status = run_func(*args)
File "/opt/ssd700gb/venv/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 248, in wrapper
return func(self, options, args)
File "/opt/ssd700gb/venv/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 377, in run
requirement_set = resolver.resolve(
File "/opt/ssd700gb/venv/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 101, in resolve
raise error from e
pip._internal.exceptions.DistributionNotFound: No matching distribution found for fairseq2n==0.1.0
Remote version of pip: 23.2.1
Local version of pip: 23.2.1
Was pip installed by pip? True
Removed build tracker: '/tmp/pip-build-tracker-1winxm5u'

Predict using code

I wanted to ask is it possible to run inference using code ,

I am tryting to do S2ST but i dont know what to put in speech units part in the code given below

wav, sr = translator.synthesize_speech(<speech_units>, <tgt_lang>)

# Save the translated audio generation.
torchaudio.save(
    <path_to_save_audio>,
    wav[0].cpu(),
    sample_rate=sr,
)

Streaming audio S2TT Example/Guidance

I am interested in streaming audio chunks and performing continuous S2TT. Is this possible with the current code? If so, any guidance would be much appreciated!

Commercial License

#28

I agree with this. Please make it a commercial license. LLaMa 2 is now everywhere. Hope this model can have similar success. without the commercial license, there is not much use of this model. I support @bitnom 's comments in the above issue. Since it is closed, i am raising a new one to get some support on this.

KeyError: 'generator'

Due to Internet reasons, I downloaded multitask_unity_large.pt and tokenizer.model from https://huggingface.co/facebook/seamless-m4t-large/tree/main by myself. And I modified the pathname in fairseq2/models/utils/model_loader.py and fairseq2/models/nllb/loader.py for the model and tokenizer.

But run with error:

m4t_predict <input_text> t2tt <tgt_lang> --src_lang <src_lang>

Traceback (most recent call last):
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/models/utils/checkpoint_loader.py", line 83, in load_checkpoint
checkpoint = converter(checkpoint)
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/seamless_communication/models/vocoder/loader.py", line 29, in _upgrade_checkpoint
old_state_dict = checkpoint["generator"]
KeyError: 'generator'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/anaconda3/envs/seamless/bin/m4t_predict", line 8, in
sys.exit(main())
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/m4t_scripts/predict/predict.py", line 72, in main
translator = Translator(args.model_name, args.vocoder_name, device, dtype)
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/seamless_communication/models/inference/translator.py", line 79, in init
self.vocoder: Vocoder = self.load_model_for_inference(
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/seamless_communication/models/inference/translator.py", line 90, in load_model_for_inference
model = load_model_fn(model_name_or_card, device=device, dtype=dtype)
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/models/utils/model_loader.py", line 188, in call
checkpoint = load_checkpoint(
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/models/utils/checkpoint_loader.py", line 85, in load_checkpoint
raise_error(ex)
File "/data/anaconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/models/utils/checkpoint_loader.py", line 70, in raise_error
raise RuntimeError(
RuntimeError: The load of the checkpoint of the model 'vocoder_36langs' has failed. See nested exception for details.

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

root@Ubuntu-2204-jammy-amd64-base ~/seamless/seamless_communication # python3 scripts/m4t/predict/predict.py привет t2tt eng --src_lang rus
INFO:__main__:Running inference on the CPU.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
Traceback (most recent call last):
  File "/root/seamless/seamless_communication/scripts/m4t/predict/predict.py", line 86, in <module>
    main()
  File "/root/seamless/seamless_communication/scripts/m4t/predict/predict.py", line 67, in main
    translated_text, wav, sr = translator.predict(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/seamless_communication/models/inference/translator.py", line 209, in predict
    result = self.get_prediction(
  File "/usr/local/lib/python3.10/dist-packages/seamless_communication/models/inference/translator.py", line 120, in get_prediction
    return generator(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/seamless_communication/models/unity/generator.py", line 175, in __call__
    text_output = self.t2t_generator.generate_ex(source_seqs, source_seq_lens)
  File "/usr/local/lib/python3.10/dist-packages/fairseq2/generation/text.py", line 155, in generate_ex
    return self._do_generate(source_seqs, source_seq_lens)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fairseq2/generation/text.py", line 71, in _do_generate
    encoder_output, encoder_padding_mask = self.model.encode(
  File "/usr/local/lib/python3.10/dist-packages/seamless_communication/models/unity/model.py", line 191, in encode
    return self.encoder(seqs, padding_mask)  # type: ignore[no-any-return]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fairseq2/nn/transformer/encoder.py", line 155, in forward
    seqs, padding_mask = layer(seqs, padding_mask)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fairseq2/nn/transformer/encoder_layer.py", line 175, in forward
    seqs = self._forward_self_attn(seqs, padding_mask)
  File "/usr/local/lib/python3.10/dist-packages/fairseq2/nn/transformer/encoder_layer.py", line 187, in _forward_self_attn
    seqs = self.self_attn_layer_norm(seqs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fairseq2/nn/normalization.py", line 107, in forward
    return layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Hello. I have a problems with running

python3 scripts/m4t/predict/predict.py привет t2tt eng --src_lang rus

Any ideas how to solve?

precision error

(seamless) root@55d07513038c:~/seamless_communication# m4t_predict mirror.wav s2tt eng --model_name seamlessM4T_large
2023-08-23 04:07:17,769 INFO -- m4t_scripts.predict.predict: Running inference on the CPU.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
Traceback (most recent call last):
  File "/root/miniconda3/envs/seamless/bin/m4t_predict", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/m4t_scripts/predict/predict.py", line 70, in main
    translated_text, wav, sr = translator.predict(
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/seamless_communication/models/inference/translator.py", line 209, in predict
    result = self.get_prediction(
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/seamless_communication/models/inference/translator.py", line 120, in get_prediction
    return generator(
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/seamless_communication/models/unity/generator.py", line 173, in __call__
    text_output = self.s2t_generator.generate_ex(source_seqs, source_seq_lens)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/generation/text.py", line 155, in generate_ex
    return self._do_generate(source_seqs, source_seq_lens)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/generation/text.py", line 71, in _do_generate
    encoder_output, encoder_padding_mask = self.model.encode(
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/seamless_communication/models/unity/model.py", line 190, in encode
    seqs, padding_mask = self.encoder_frontend(seqs, seq_lens)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/models/wav2vec2/frontend.py", line 130, in forward
    seqs, seq_lens = self.extract_features(seqs, seq_lens)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/models/wav2vec2/frontend.py", line 163, in extract_features
    seqs = self.post_extract_layer_norm(seqs)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/fairseq2/nn/normalization.py", line 107, in forward
    return layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
  File "/root/miniconda3/envs/seamless/lib/python3.10/site-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Any ideas?

Hosting the model on a notebook

Hi, there. I tried to install the package with the code

pip install fairseq2==0.1
pip install .

However, I met the error such that

"ERROR: Cannot uninstall 'TBB'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.".

May I ask how to solve this issue? Thank you very much in advance.

Voice translated incorrectly, into generic phrases

Text translated into generic that has nothing in common with the original text: I don't know, I don't know what to do etc

Command used: m4t_predict test.wav s2st eng --output_path res.wav
Source language: Russian, target lang - English

3 test runs, using the same input data:

  1. m4t_scripts.predict.predict: Translated text in eng: I don't know what to do. I don't know what to do. I don't know what to do.
  2. m4t_scripts.predict.predict: Translated text in eng: I'm not sure I'm going to be able to do it, but I'm going to do it.
  3. Translated text in eng: I don't know. I don't know. I don't know.

OS: Ubuntu 22.04.3 LTS
Python: 3.11.4
Conda env
GPU: RTX 2060

Licensing and speech APIs (Dear Mark)

I just want to put it on the record here that achieving anything close to what this model provides is prohibitively expensive and prone to technical issues, for any startup. We could really benefit from a conditional, commercial-friendly license for this model.

At least LLAMA2 gave us a pathway to bringing useful things to market. Open-source will eventually surpass this model anyway. It seems like it would have been super-cool to lock startups into a LLAMA2-like license, which would pay off down the road.

I got really excited when I saw the news headline for this model. Elevenlabs and bark are not fun to build around (Not throwing shade).

Please Mark, liberate us from them.

m4t_predict: command not found

Unable to find the m4t_predict command. Can you please advise.

ubuntu 20.04.1

pip --version
pip 23.0.1 from /usr/local/lib/python3.8/dist-packages/pip (python 3.8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.