Giter Site home page Giter Site logo

Comments (12)

sarapapi avatar sarapapi commented on May 27, 2024 1

Hi, you should remove --prefix-size 1 --prefix-token "nomt" if you are not using the IWSLT 2023 models (which were trained with the language id preprended as the first token). Please remove these and rerun the code.

Regarding your issue with --user-dir, I am not able to replicate it locally at the moment. Can you please share your environment?

Regarding our models, they have not been developed to be competitive with production systems, for building strong models you need to train them with thousands of hours of audio, while MuST-C is only composed of 200/300 hours of high-quality and clean audio (with no background noise).

from fbk-fairseq.

sarapapi avatar sarapapi commented on May 27, 2024

Hi, can you please install Python 3.8 and rerun the code? Thanks for your interest in our work.

from fbk-fairseq.

kaahan avatar kaahan commented on May 27, 2024

Downgrading to Python 3.8 worked (though I had to install praat-parselmouth and torchaudio). However, on running the model, I'm getting poor results. For the attached audio file and this command:

!simuleval \
    --agent examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py \
    --source /workspace/source.txt \
    --target /workspace/target.txt \
    --config config_simul.yaml \
    --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt \
    --extract-attn-from-layer 3 \
    --frame-num 4 \
    --speech-segment-factor 10 \
    --output /content/ \
    --port 8000 \
    --gpu \
    --scores

instances.log has this to say:

{"index": 0, "prediction": "\u266b So bu le : O o h , o o h , o o h , o o h . \u266b O o h , o o h , o o h , o o h . \u266b </s>", "delays": [800.0, 1200.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 3600.0, 4800.0, 4800.0, 4800.0, 4800.0, 10800.0, 10800.0, 10800.0, 10800.0, 10800.0, 10800.0, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398], "elapsed": [3245.3799724578857, 3832.2

https://github.com/hlt-mt/FBK-fairseq/assets/106453090/8c6a585e-b049-41ca-8b53-5485e73e58da

195529937744, 4439.988946914673, 4442.562913894653, 4444.907760620117, 4448.003625869751, 4450.927114486694, 4453.659391403198, 4456.332778930664, 4459.060287475586, 4461.990690231323, 4464.865064620972, 4467.783546447754, 4470.6488609313965, 4473.50058555603, 4476.436710357666, 4479.426717758179, 4482.651567459106, 4485.540246963501, 4488.520240783691, 7730.323886871338, 9779.305267333984, 9782.306957244873, 9785.19778251648, 9788.330364227295, 20424.693155288696, 20427.18677520752, 20429.49514389038, 20431.778955459595, 20434.066343307495, 20435.98656654358, 22892.254254628744, 22894.257447530355, 22896.099946309652, 22897.93910722884, 22899.78947382125, 22901.618144322958, 22903.44562273177, 22905.49459200057, 22907.07912187728], "prediction_length": 40, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio.wav", "samplerate: 44100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.825396825398, "reference_length": 30, "metric": {"sentence_bleu": 1.205256842736819, "latency": {"AL": -1892.7086181640625, "AP": 0.47255739569664, "DAL": 1866.387451171875}, "latency_ca": {"AL": 1129.2596435546875, "AP": 0.9680058360099792, "DAL": 7387.2373046875}}}

The prediction seems quote non-sensical :-/

one_culver_audio.mp4

from fbk-fairseq.

sarapapi avatar sarapapi commented on May 27, 2024

Hi, that’s strange. Can you please show me the log file (the stdout SimulEval produces)?
Also, SimulEval and our models work with wav files with 1 channel and 16kHz of sampling rate (standard conversion). Can you please try to convert the audio, which is in mp4, using this settings and rerun the script?
Thanks

from fbk-fairseq.

kaahan avatar kaahan commented on May 27, 2024

Updated instance.log after changing to 16khz, 1 channel (I was already using a wav file, GitHub would only let me upload mp4 πŸ™ƒ):

{"index": 0, "prediction": "\u266b en la tierra , en el campo , en el cielo , en el cielo , en la tierra . \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo ,en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , \u266b </s>", "delays": [2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2400.0, 2400.0, 3600.0, 3600.0, 3600.0, 3600.0, 3600.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 10000.0, 10000.0, 10000.0, 10000.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 11200.0, 11200.0, 11200.0, 11200.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645,11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4843.686819076538, 4845.778942108154, 4848.869800567627, 4851.950168609619, 4854.800462722778, 4857.587575912476, 4860.589981079102, 5412.688636779785, 5415.485048294067, 7128.985023498535, 7131.087636947632, 7132.9320430755615, 7135.399436950684, 7138.24520111084, 9567.100238800049, 9569.137287139893, 9571.778964996338, 9574.652862548828, 9577.43592262268, 9580.456686019897, 9583.590698242188, 17199.337005615234, 17201.39741897583, 17203.248262405396, 17205.052614212036, 18548.866176605225, 18551.240348815918, 18553.430461883545, 18556.083583831787, 18558.77676010132, 18561.42511367798, 18564.08109664917, 18566.757345199585, 18569.596672058105, 18572.455549240112, 18575.34852027893, 18578.041458129883, 21721.597385406494, 21724.094104766846, 21726.03816986084, 21729.12187576294, 24012.092241145067, 24014.087804652147, 24015.938171244554, 24017.74323830693, 24019.595273829393, 24022.0617140302, 24025.21432290166, 24028.136381007127, 24030.958064890794, 24033.733257151536, 24036.878713465623, 24039.005884028367, 24040.777334071092, 24042.553790903978, 24044.304021693162, 24046.05115304082, 24049.00205979436, 24051.829704142503, 24054.84212288945, 24907.369502879075, 24909.839996195726, 24912.148126460008, 24914.335616923265, 24917.053350306443, 24919.854768610887, 24922.658332682542, 24925.372489787034, 24928.38824639409, 24931.31292710393, 24934.21090493291, 24936.883338786058, 24939.5204866895, 24942.180761195115, 24945.013650752, 24947.85035500615, 24950.693973399095, 24953.515418864183, 24956.343540049485, 24959.166416026048, 24961.803087092332, 24964.39588914006, 24967.00776467412, 24969.601281977586, 24972.40246186345, 24975.23964295476, 24978.075393534593, 24980.874427653245, 24983.71232400029, 24986.548789835862, 24989.34925446599, 24992.193111277513, 24995.02266297429, 24997.637876368455, 25000.252136088304, 25002.96033272832, 25005.57888398259, 25008.380302287034, 25011.21819863408, 25014.04178986638, 25016.85250649541, 25019.682058192186, 25022.505887843065, 25025.308498240403, 25027.948268748216, 25030.544170237474, 25033.13029656499, 25035.96151719182, 25038.79512200444, 25041.599401331834, 25044.409164286546, 25047.05918679326, 25049.724229670457, 25052.331336832933, 25055.132993555955, 25057.92654404729, 25060.739883280687, 25063.449272013597, 25066.04612717717, 25068.717607356004, 25071.35928521245, 25074.152120448045, 25077.24011788457, 25079.786189890794], "prediction_length": 124, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645, "reference_length": 30, "metric": {"sentence_bleu": 0.7415472433597086, "latency": {"AL": -282.8108215332031, "AP": 0.8600947260856628, "DAL": 7117.89599609375}, "latency_ca": {"AL": 3435.63134765625, "AP": 1.7415765523910522, "DAL": 17131.087890625}}}

Here's the stdout from SimulEval:

(workspace-3.8) root@05ee56face0f:/workspace/FBK-fairseq# simuleval     --agent examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py     --source /workspace/source.txt     --target /workspace/target.txt --data-bin /workspace/FBK-fairseq/checkpoint/     --config config_simul.yaml     --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt     --extract-attn-from-layer 3     --frame-num 4     --speech-segment-factor 10     --output /content/     --port 8000     --gpu     --scores
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Evaluating on speech
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Source: /workspace/source.txt
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Target: /workspace/target.txt
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Number of sentences: 1
2023-10-26 22:55:28 | INFO     | simuleval.server | Evaluation Server Started (process id 3964). Listening to port 8000
2023-10-26 22:55:31 | WARNING  | simuleval.scorer | Resetting scorer
2023-10-26 22:55:31 | INFO     | simuleval.cli    | Output dir: /content/
2023-10-26 22:55:31 | INFO     | simuleval.cli    | Start data writer (process id 3970)
2023-10-26 22:55:31 | INFO     | simuleval.cli    | Evaluating AlignAttSTAgent (process id 3902) on instances from 0 to 0
2023-10-26 22:55:37 | INFO     | examples.speech_to_text.tasks.speech_to_text_ctc | target dictionary size (/workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.txt): 8,000
2023-10-26 22:55:37 | INFO     | examples.speech_to_text.tasks.speech_to_text_ctc | source dictionary size (/workspace/FBK-fairseq/checkpoint/spm_unigram.en.txt): 5,002
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Evaluation results:
{
    "Quality": {
        "BLEU": 0.7659623558516302
    },
    "Latency": {
        "AL": -282.8108215332031,
        "AL_CA": 3435.63134765625,
        "AP": 0.8600947260856628,
        "AP_CA": 1.7415765523910522,
        "DAL": 7117.89599609375,
        "DAL_CA": 17131.087890625
    }
}
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Evaluation finished
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Close data writer
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Shutdown server

from fbk-fairseq.

kaahan avatar kaahan commented on May 27, 2024

Here is my configuration if that's helpful:

bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: /workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.model
bpe_tokenizer_src:
  bpe: sentencepiece
  sentencepiece_model: /workspace/FBK-fairseq/checkpoint/spm_unigram.en.model
global_cmvn:
  stats_npz_path: /workspace/FBK-fairseq/checkpoint/gcmvn.npz
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - global_cmvn
  _train:
  - global_cmvn
  - specaugment
vocab_filename: /workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.txt
vocab_filename_src: /workspace/FBK-fairseq/checkpoint/spm_unigram.en.txt

from fbk-fairseq.

sarapapi avatar sarapapi commented on May 27, 2024

Hi, I noticed an error in the README (the --speech-segment-factor has to be 25) and in the scripts working with the "old" version of SimulEval. I'm working on fixing them, thanks for pointing it out.
By the way, we have the new version of the code which works with the new SimulEval, you can find it here. I tried it on your audio file and our model works as expected but the performance is poor, mostly because it has been trained only on MuST-C and it is not intended to be robust out of domain.

from fbk-fairseq.

kaahan avatar kaahan commented on May 27, 2024

Hm, when trying to run the instructions on the new version of SimulEval, I'm running into the following error:

root/.local/share/pdm/venvs/workspace-6rDWGpm2-fairseq/lib/python3.8/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Traceback (most recent call last):
  File "/root/.local/share/pdm/venvs/workspace-6rDWGpm2-fairseq/bin/simuleval", line 33, in <module>
    sys.exit(load_entry_point('simuleval', 'console_scripts', 'simuleval')())
  File "/workspace/SimulEval/simuleval/cli.py", line 47, in main
    system, args = build_system_args()
  File "/workspace/SimulEval/simuleval/utils/agent.py", line 138, in build_system_args
    system_class.add_args(parser)
  File "/workspace/FBK-fairseq/examples/speech_to_text/simultaneous_translation/agents/v1_1/simul_offline_edatt.py", line 51, in add_args
    BaseSimulSTAgent.add_args(parser)
  File "/workspace/FBK-fairseq/examples/speech_to_text/simultaneous_translation/agents/base_simulst_agent.py", line 84, in add_args
    parser.add_argument("--user-dir", type=str, default="examples/simultaneous_translation",
  File "/usr/lib/python3.8/argparse.py", line 1398, in add_argument
    return self._add_action(action)
  File "/usr/lib/python3.8/argparse.py", line 1761, in _add_action
    self._optionals._add_action(action)
  File "/usr/lib/python3.8/argparse.py", line 1602, in _add_action
    action = super(_ArgumentGroup, self)._add_action(action)
  File "/usr/lib/python3.8/argparse.py", line 1412, in _add_action
    self._check_conflict(action)
  File "/usr/lib/python3.8/argparse.py", line 1551, in _check_conflict
    conflict_handler(action, confl_optionals)
  File "/usr/lib/python3.8/argparse.py", line 1560, in _handle_conflict_error
    raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --user-dir: conflicting option string: --user-dir

This is with the following run command:

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
    --source /workspace/source.txt \
    --target /workspace/target.txt \
    --data-bin /workspace/FBK-fairseq/checkpoint/ \
    --config config_simul.yaml \
    --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt --prefix-size 1 --prefix-token "nomt" \
    --extract-attn-from-layer 3 --frame-num 4 \
    --source-segment-size 1000 \
    --device cuda:0 \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
    --output /content/

performance is poor, mostly because it has been trained only on MuST-C and it is not intended to be robust out of domain.

I'm a bit confused by this -- isn't MuST-C a TED based dataset? It should have reverb, some crowd noise, etc. that would appear to be harder than the audio I've sent.

from fbk-fairseq.

kaahan avatar kaahan commented on May 27, 2024

Removing the --user-dir argument from base_simulst_agent.py fixed this (though that seems a little suspect). I'm getting the following result:

{"index": 0, "prediction": "fuerte como el video de un ni\u00f1o, probablemente nunca lo hemos escrito tan pronto como sea posible, estamos fuera de un solo mundo, aunque no est\u00e1bamos en ninguno de nosotros.", "delays": [2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 6000.0, 6000.0, 7000.0, 7000.0, 7000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4337.5279903411865, 4337.5279903411865, 4337.5279903411865, 5465.068101882935, 5465.068101882935, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 9043.581485748291, 9043.581485748291, 10249.344110488892, 10249.344110488892, 10249.344110488892, 12713.966131210327, 12713.966131210327, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 15320.341110229492, 15320.341110229492, 16591.46321663945, 16591.46321663945, 16591.46321663945, 16591.46321663945], "prediction_length": 30, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}

which appears a little more reasonable, but quite poor still.

Running it on cleaned audio (i.e. background removed), results in better results, though it does seem to struggle with proper nouns :-)

{"index": 0, "prediction": "fuerte como el video de un ni\u00f1o, probablemente nunca lo hemos escrito tan pronto como sea posible, estamos fuera de un solo mundo, aunque no est\u00e1bamos en ninguno de nosotros.", "delays": [2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 6000.0, 6000.0, 7000.0, 7000.0, 7000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4260.651111602783, 4260.651111602783, 4260.651111602783, 5385.631322860718, 5385.631322860718, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 8954.18643951416, 8954.18643951416, 10157.024145126343, 10157.024145126343, 10157.024145126343, 12613.478422164917, 12613.478422164917, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 15209.887981414795, 15209.887981414795, 16476.026662684373, 16476.026662684373, 16476.026662684373, 16476.026662684373], "prediction_length": 30, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}
{"index": 1, "prediction": "fuerte: Esta es una prueba de la globalizaci\u00f3n de video, probablemente tiene ese gui\u00f3n ah\u00ed, as\u00ed que vamos a probar otra cosa. Estamos en un octubre. \u00bfPor qu\u00e9 trabajamos en la oficina de Apple?", "delays": [2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 5000.0, 5000.0, 5000.0, 5000.0, 6000.0, 6000.0, 6000.0, 7000.0, 7000.0, 8000.0, 9000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 5820.873022079468, 5820.873022079468, 5820.873022079468, 5820.873022079468, 7076.664209365845, 7076.664209365845, 7076.664209365845, 8353.463888168335, 8353.463888168335, 9646.378993988037, 10938.165664672852, 10938.165664672852, 10938.165664672852, 12264.750719070435, 12264.750719070435, 12264.750719070435, 13640.005826950073, 13640.005826950073, 14967.794545985156, 14967.794545985156, 14967.794545985156, 14967.794545985156], "prediction_length": 34, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_cleaned_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}

It does have this odd property of adding fuerte: in front of the translations -- is this an artifact of MuST-C?

from fbk-fairseq.

kaahan avatar kaahan commented on May 27, 2024

Removing the prefix related arguments removed the fuertes, thanks!

Regarding your issue with --user-dir, I am not able to replicate it locally at the moment. Can you please share your environment?

I'm not sure which parts of my environment you'd like replicated, but my pip freeze looks like:

antlr4-python3-runtime==4.8
bitarray==2.6.0
Brotli==1.1.0
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.1
colorama==0.4.6
coverage==7.3.2
ctc-segmentation==1.7.4
Cython==3.0.4
exceptiongroup==1.1.3
fairseq==1.0.0a0+4b7966b
filelock==3.12.4
flake8==6.1.0
fsspec==2023.10.0
hydra-core==1.0.7
idna==3.4
importlib-resources==6.1.0
iniconfig==2.0.0
Jinja2==3.1.2
lxml==4.9.3
MarkupSafe==2.1.3
mccabe==0.7.0
mpmath==1.3.0
mutagen==1.47.0
networkx==3.1
numpy==1.24.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.52
nvidia-nvtx-cu12==12.1.105
omegaconf==2.0.6
packaging==23.2
pandas==2.0.3
pluggy==1.3.0
portalocker==2.0.0
praat-parselmouth==0.4.3
pycodestyle==2.11.1
pycparser==2.21
pycryptodomex==3.19.0
pydub==0.25.1
pyflakes==3.1.0
pytest==7.4.3
pytest-cov==4.1.0
pytest-flake8==1.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
sacrebleu==2.3.1
-e git+https://github.com/facebookresearch/SimulEval.git@411a73d60d0626d8519f58d02a284fb53a263cad#egg=simuleval
six==1.16.0
soundfile==0.12.1
srt==3.5.3
sympy==1.12
tabulate==0.9.0
TextGrid==1.5
tomli==2.0.1
torch==2.1.0
torchaudio==2.1.0
tornado==6.3.3
tqdm==4.64.1
triton==2.1.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
websockets==12.0
yt-dlp==2023.10.13
zipp==3.17.0

Regarding our models, they have not been developed to be competitive with production systems, for building strong models you need to train them with thousands of hours of audio, while MuST-C is only composed of 200/300 hours of high-quality and clean audio (with no background noise).

And totally reasonable on the competitive with production systems side -- do you feel like model architecture, as is, would scale well to thousands of hours of audio?

from fbk-fairseq.

sarapapi avatar sarapapi commented on May 27, 2024

Hi, I believe the error is related to the version of SimulEval. If you install the tool from the commit in the guide, you should be able to solve the issue for the --user-dir.

And totally reasonable on the competitive with production systems side -- do you feel like model architecture, as is, would scale well to thousands of hours of audio?

I think that models that are trained with thousands of data, such as Whisper, are not much different from our model architecture. Whisper Small has 12 layers of encoder, just like our model, even if we have a Conformer instead of a Transformer. Of course, if you want to scale to much more data, bigger models are better, in general.

from fbk-fairseq.

sarapapi avatar sarapapi commented on May 27, 2024

I am closing this as it has been stale for a while. Feel free to reopen if anything else is needed.

from fbk-fairseq.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.