Giter Site home page Giter Site logo

qwen-audio's Introduction

中文  |   English  




Qwen-Audio 🤖 | 🤗  | Qwen-Audio-Chat 🤖 | 🤗  |    Demo 🤖 | 🤗 
  Homepage  |   Paper   |    WeChat   |   Discord  



PWC PWC PWC PWC PWC PWC
PWC
PWC
PWC PWC PWC

Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:

  • Fundamental audio models: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
  • Multi-task learning framework for all types of audios: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
  • Strong Performance: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
  • Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.


We release two models of the Qwen-Audio series soon:

  • Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and Whisper-large-v2 as the initialization of the audio encoder.
  • Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.

News and Updates

  • 2023.11.30 🔥 We have released the checkpoints of both Qwen-Audio and Qwen-Audio-Chat on ModelScope and Hugging Face.
  • 2023.11.15 🎉 We released a paper for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.

Evaluation

We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:

The below is the overal performance:

The details of evaluation are as follows:

Automatic Speech Recognition

Dataset Model Results (WER)
dev-clean dev-othoer test-clean test-other
Librispeech SpeechT5 2.1 5.5 2.4 5.8
SpeechNet - - 30.7 -
SLM-FT - - 2.6 5.0
SALMONN - - 2.1 4.9
Qwen-Audio 1.8 4.0 2.0 4.2
Dataset Model Results (WER)
dev test
Aishell1 MMSpeech-base 2.0 2.1
MMSpeech-large 1.6 1.9
Paraformer-large - 2.0
Qwen-Audio 1.2 (SOTA) 1.3 (SOTA)
Dataset Model Results (WER)
Mic iOS Android
Aishell2 MMSpeech-base 4.5 3.9 4.0
Paraformer-large - 2.9 -
Qwen-Audio 3.3 3.1 3.3

Soeech-to-text Translation

Dataset Model Results (BLUE)
en-de de-en en-zh zh-en es-en fr-en it-en
CoVoST2 SALMMON 18.6 - 33.1 - - - -
SpeechLLaMA - 27.1 - 12.3 27.9 25.2 25.9
BLSP 14.1 - - - - - -
Qwen-Audio 25.1 33.9 41.5 15.7 39.7 38.5 36.0

Automatic Audio Caption

Dataset Model Results
CIDER SPICE SPIDEr
Clotho Pengi 0.416 0.126 0.271
Qwen-Audio 0.441 0.136 0.288

Speech Recognition with Word-level Timestamp

Dataset Model AAC (ms)
Industrial Data Force-aligner 60.3
Paraformer-large-TP 65.3
Qwen-Audio 51.5 (SOTA)

Automatic Scene Classification

Dataset Model ACC
Cochlscene Cochlscene 0.669
Qwen-Audio 0.795 (SOTA)
TUT2017 Pengi 0.353
Qwen-Audio 0.649

Speech Emotion Recognition

Dataset Model ACC
Meld WavLM-large 0.542
Qwen-Audio 0.557

Audio Question & Answer

Dataset Model Results
ACC ACC (binary)
ClothoAQA ClothoAQA 0.542 0.627
Pengi - 0.645
Qwen-Audio 0.579 0.749

Vocal Sound Classification

Dataset Model ACC
VocalSound CLAP 0.4945
Pengi 0.6035
Qwen-Audio 0.9289 (SOTA)

Music Note Analysis

Dataset Model NS. Qualities (MAP) NS. Instrument (ACC)
NSynth Pengi 0.3860 0.5007
Qwen-Audio 0.4742 0.7882

We have provided all evaluation scripts to reproduce our results. Please refer to eval_audio/EVALUATION.md for details.

Evaluation of Chat

To evaluate the chat abilities of Qwen-Audio-Chat, we provide TUTORIAL and demo for users.

Requirements

  • python 3.8 and above
  • pytorch 1.12 and above, 2.0 and above are recommended
  • CUDA 11.4 and above are recommended (this is for GPU users)
  • FFmpeg

Quickstart

Below, we provide simple examples to show how to use Qwen-Audio and Qwen-Audio-Chat with 🤖 ModelScope and 🤗 Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.

pip install -r requirements.txt

Now you can start with ModelScope or Transformers. For more usage, please refer to the tutorial. Qwen-Audio models currently perform best with audio clips under 30 seconds.

🤗 Transformers

To use Qwen-Audio-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url
    {'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2nd dialogue turn
response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)
# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

Running Qwen-Audio pretrained base model is also simple.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
audio_url = "assets/audio/1272-128104-0000.flac"
sp_prompt = "<|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>"
query = f"<audio>{audio_url}</audio>{sp_prompt}"
audio_info = tokenizer.process_audio(query)
inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
inputs = inputs.to(model.device)
pred = model.generate(**inputs, audio_info=audio_info)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
print(response)
# <audio>assets/audio/1272-128104-0000.flac</audio><|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>mister quilting is the apostle of the middle classes and we are glad to welcome his gospel<|endoftext|>

In the event of a network issue while attempting to download model checkpoints and codes from Hugging Face, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# Downloading model checkpoint to a local dir model_dir
model_id = 'qwen/Qwen-Audio-Chat'
revision = 'master'
model_dir = snapshot_download(model_id, revision=revision)

# Loading local checkpoints
# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="cuda",
    trust_remote_code=True
).eval()

🤖 ModelScope

ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:

from modelscope import (
    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch
model_id = 'qwen/Qwen-Audio-Chat'
revision = 'master'

model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
    tokenizer.model_dir = model_dir
# use bf16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()
# use CPU
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# use gpu
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url
    {'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2st dialogue turn
response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)
# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

Demo

Web UI

We provide code for users to build a web UI demo. Before you start, make sure you install the following packages:

pip install -r requirements_web_demo.txt

Then run the command below and click on the generated link:

python web_demo_audio.py

FAQ

If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.

We Are Hiring

If you are interested in joining us as full-time or intern, please contact us at [email protected].

License Agreement

Researchers and developers are free to use the codes and model weights of both Qwen-Audio and Qwen-Audio-Chat. We also allow their commercial use. Check our license at LICENSE for more details.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}

Contact Us

If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

qwen-audio's People

Contributors

qyang1021 avatar jxu-thu avatar justinlin610 avatar faychu avatar huybery avatar

Stargazers

Zhuoran Lyu avatar Yurui Zhu avatar Rui Zhang avatar Tee Kai Feng avatar  avatar  avatar  avatar tangyoha avatar 青玄 avatar  avatar yst avatar wwwlll avatar Jack Yu avatar  avatar  avatar IceRiver avatar FS.IO avatar Ustinian avatar 小赵要努力 avatar  avatar  avatar  avatar XuYaoyang avatar HonestQiao avatar DEO avatar  avatar Gary Feng avatar Richie Liu avatar  avatar Mason avatar Alexey Tolstosheev avatar Al Chen avatar Yuxin Guo avatar  avatar csf  avatar Car avatar  avatar Jiaming Zhou avatar  avatar ldwang avatar Amantur Amatov avatar  avatar  avatar 梁添 avatar  avatar brucevanfdm avatar  avatar shzy2012 avatar Billy avatar Kai-Wei Chang (張凱爲) avatar Fabian-Robert Stöter avatar hansshuo avatar 俞辰葳 avatar xiaozhiob avatar Yunzhe Hao avatar  avatar  avatar AlexZhang avatar Alex Wu avatar LiuLinfeng avatar  avatar wwc avatar enel avatar tangtianchi avatar cjschneider2 avatar  avatar  avatar  avatar yanghongbo avatar  avatar HACLINE avatar 七海nana7mi avatar wurentidai avatar Michael-Zhong avatar  avatar Karan Inder Singh avatar lianzhehua avatar Li Xiaohong avatar  avatar CheYujie avatar Ruby avatar Fernando Almeida avatar muhtasham avatar song avatar Giovanni Kock Bonetti avatar Yuwei Yin avatar Lesly Arun Franco avatar Raphael Costa avatar benqiang avatar Etienne Balit avatar Xu Jiahao avatar Zane Helton avatar maolong Li avatar  avatar Yash Goenka avatar Livio Gamassia avatar  avatar Jeff Carpenter avatar 冰封 avatar Yueqian Lin avatar

Watchers

Brad Jones avatar OKUMURA Yoshio avatar tang avatar  avatar Nickolay V. Shmyrev avatar liuxinglong avatar 刘国友 avatar suzhenghang avatar Zhouwei avatar  avatar Yang An avatar Iurnem avatar Qoboty avatar WYS avatar  avatar ericdejavu avatar  avatar Hengxin Yin avatar  avatar leo avatar xiaozhiob avatar Vector Ventures avatar  avatar  avatar

qwen-audio's Issues

SFT use lora? or finetune all parameters?

Thanks for your great works!

I'm interesting in adding a new task into pretrain stage , can you offer some advice or some refers?

And also, i want to know did you finetune all llm paramters in SFT or finetune only lore?

If not, why did you use model parallelism =2 ?

low gender classification accuracy

The model seems to not even able to get the gender correctly, a few samples:

question = 'Recognize the gender, age, accent, emotion, and speaking content of the person in the audio, and combine these to answer his/her questions while explaining the reasons for these answers.' # same question as in homepage
query = tokenizer.from_list_format([
    {'audio': audio},
    {'text': question},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
  • https://vocaroo.com/11gfffDeXNmQ output: The speaker of this audio is a man speaking, in a English, saying, "you know as well as i do the kind of life you offer her.".
  • https://vocaroo.com/12xbA5EZX60M output: The audio is of a man speaking, in a neutral emotion, saying, "he says no word of happiness.".
  • https://vocaroo.com/19cMpEhrfHye output: The audio is of a man speaking, in a neutral emotion, saying, "the boy‘s face was very pale as he dropped his hands from penny’s shoulders ; but dundee, from behind the portieres, was not troubling to spy for the moment.".
  • https://vocaroo.com/12hdkCS6fhYx output: The audio is of a woman speaking, in a neutral emotion, saying, "when zarathustra once told this to his disciples they asked him, and what, o zarathustra, is the moral of thy story? and zarathustra answered them thus.".

The classification accuracy for gender is lower than simple F0 cutoff with this model, which is around 75%.

What is EvaluationTokenizer?

In evaluate_asr.py, I found the loading of EvaluationTokenizer from evaluate_tokenizer, but where should I find this package?

Performance degradation on known langauges of whisper

I was trying to do a chat over a audio. Whisper-v2 audio output transcription was good like near perfect but the transcrition output of qwen did not capture whole transcription. I was trying it on hindi audio assuming whisper performance for hindi very good. Next i tried summarising audio but that also did not give proper results. Anyone who has tried qwen for languages supported by whisper?

关于Output Instruction的问题

您好,论文中有提到 “Output Instruction: Lastly, we provide output instruction to further specify the task and desired format for different subtasks, and then the text output begins.”

以下这些Output Instruction在训练和推理阶段是如何使用的?
我的理解是Output Instruction 放在prompt 结尾,如:
query = f"{audio_url}{sp_prompt}"
其中sp_prompt是"<|startofanalysis|><|unknown|><|keyword|><|zh|><|notimestamps|><|wo_itn|><|audioset_ontology|>"
不知道这种理解对不对?

Output Instruction

        "<|caption_audiocaps|>",  # Audiocaps caption style
        "<|caption_clotho|>",  # Clotho caption style
        "<|audioset_ontology|>",  # Audioset ontology style
        "<|caption_plain|>",  # plain caption
        "<|itn|>",  # inversed text normalized
        "<|wo_itn|>",  # without inversed text normalized
        "<|startofentityvalue|>",
        "<|endofentityvalue|>",
        "<|startofentitytype|>",
        "<|endofentitytype|>",
        "<|named_entity_recognition|>",  # named entity recognition task
        "<|audio_grounding|>",
        "<|startofword|>",
        "<|endofword|>",
        "<|delim|>",  # delimiter of timestamps pair in audio grounding
        "<|emotion_recognition|>",  # emotion recognition
        "<|music_description|>",  # music description
        "<|note_analysis|>",  # note analysis
        "<|pitch|>",  # note analysis: pitch
        *[f"<|midi_pitch_{i}|>" for i in range(128)],  # midi pitch 0-127
        "<|velocity|>",  # note analysis: velocity
        *[f"<|midi_velocity_{i}|>" for i in range(128)],  # midi velocity 0-127
        "<|sonic|>",  # note analysis:  sonic
        "<|instrument|>",  # note analysis:  instrument
        "<|speaker_meta|>",  # meta information of speaker
        "<|song_meta|>",  # meta information of song
        "<|question|>",  # AQA: question
        "<|answer|>",  # AQA: answer
        "<|choice|>",  # AQA: answer choice
        "<|scene|>",  # scene recognition
        "<|event|>",  # sound event
        "<|vocal_classification|>",  # vocal classification
        "<|speech_understanding|>",  # speech language understanding
        "<|scenario|>",  # speech language understanding: scenario
        "<|action|>",  # speech language understanding: action
        "<|entities|>",  # speech language understanding: entities
        "<|speech_edit|>",  # speech edit

allow_pickle=False

请问我在运行README中的示例代码时,为何会出现下述错误。
ValueError: Cannot load file containing pickled data when allow_pickle=False
我查找教程发现需要修改所产生缓存文件的代码,但修改其代码后再运行发现被新的缓存覆盖,现在不知道该如何修改了。

Mac M1 runs painfully slow

I wrote a script to make an inference on an audio file, and it took 30 minutes to get a response:

script:

`
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

Initialize model

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cpu", trust_remote_code=True).eval()

Set up audio file

audio_file = "file.mp3"

Define helper validation function

def validate_response(text):
if ',' not in text:
print("Error - response did not contain comma delimited list")
return False
return True

Time full execution

start = time.time()

Attempt audio analysis

try:
query = tokenizer.from_list_format([
{'audio': audio_file},
{'text': 'Describe this audio with 5 comma-separated adjectives'},
])
response, history = model.chat(tokenizer, query=query, history=None)

# Print full response and validate
print(response)
valid = validate_response(response)

except Exception as e:
print(f"Error: {e}")
valid = False

end = time.time()
elapsed = end - start

Print total time

print(f"Elapsed time: {elapsed} seconds")

Check if the response was valid

if not valid:
print("Issues with audio analysis")`

result:

`python caption_audio.py
audio_start_id: 155164, audio_end_id: 155165, audio_pad_id: 151851.
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████| 9/9 [00:00<00:00, 17.97it/s]
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Audio 1:file.mp3
Describe this audio with 5 comma-separated adjectives<|im_end|>
<|im_start|>assistant

Funky, groovy, energetic, hip-hop, dance
Elapsed time: 1812.089405298233 seconds`

Also inconsistent results running the same script a subsequent time:

second result:

`python caption_audio.py
audio_start_id: 155164, audio_end_id: 155165, audio_pad_id: 151851.
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████| 9/9 [00:00<00:00, 9.99it/s]
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Audio 1:file.mp3
Describe this audio with 5 comma-separated adjectives<|im_end|>
<|im_start|>assistant

This is a funky, groovy, upbeat, electronic, electro, electronic beats, electronic music, electronic dance, electronic instrumental, electronic background, electronic soundtrack, electronic soundtrack, electro hip hop, electro hip hop beats, electro hip hop instrumental, electro rap, electro rap beats, electro rap instrumental, electronic music for tv, electronic music for radio, electronic music for advertising, electronic music for games, electronic music for movies, electronic music for youtube, electronic music for corporate use, electronic music for commercial use, electronic music for business, electronic music for presentations, electronic music for websites, electronic music for apps, electronic music for software, electronic music for advertising, electronic music for marketing, electronic music for product presentation, electronic music for corporate presentations, electronic music for business videos, electronic music for product videos, electronic music for commercials, electronic music for radio, electronic music for tv, electronic music for films, electronic music for viral marketing, electronic music for web advertisements, electronic music for youtube videos, electronic music for social media, electronic music for apps, electronic music for mobile, electronic music for corporate, electronic music for background, electronic music for fashion, electronic music for lifestyle, electronic music for beauty, electronic music for food, electronic music for travel, electronic music for health, electronic music for fitness, electronic music for meditation, electronic music for relaxation, electronic music for studying, electronic music for working, electronic music for sleeping, electronic music for dancing, electronic music for fun, electronic music for playing, electronic music for singing, electronic music for podcast, electronic music for video games, electronic music for background music, electronic music for club, electronic music for rapping, electronic music for beat-making, electronic music for producing, electronic music for composing, electronic music for singing, electronic music for playing, electronic music for dancing, electronic music for fun, electronic music for playing, electronic music for singing, electronic music for producing, electronic music for composing, electronic music for radio, electronic music for tv, electronic music for films, electronic music for viral marketing, electronic music for web advertisements, electronic music for youtube videos, electronic music for social media, electronic music for apps, electronic music for mobile, electronic music for corporate, electronic music for background, electronic music for fashion, electronic music for lifestyle, electronic music for beauty, electronic music for food, electronic music for travel, electronic music for health, electronic music for fitness, electronic music for meditation, electronic music for relaxation, electronic music for studying, electronic music for working, electronic music for sleeping, electronic music for
Elapsed time: 60488.493457078934 seconds`

any tips to make this run more efficiently?

复现实验结果有差距

我这边在复现aishell1上的结果时,出现了一定程度的不遵循指令现象(比如我未指定时间戳,但结果仍然输出),最终结果不做去除特殊token的话差了绝对0.63个点,去除后差了绝对0.24,这里展示一下不遵循我指令的现象
image
image
最后,求一个微信群二维码,我的微信:royd99

Qwen-Audio给的示例Demo输入本地音频文件没有跑出转写的文本结果? 能提供相应的例子吗

这是我的代码
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
import re
import os
import glob
import time

torch.manual_seed(1234)

model_path="/home/wzp/.cache/modelscope/hub/qwen/Qwen-Audio"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()

打开fp16精度,V100、P100、T4等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()

使用CPU进行推理,需要约32GB内存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()

默认gpu进行推理,需要约24GB显存

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda", trust_remote_code=True, bf16=True).eval()

可指定不同的生成长度、top_p等相关超参(transformers 4.32.0及以上无需执行此操作)

model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

audio_url = "/home/wzp/project/yolov8/modelscope/output.wav"
sp_prompt = "<|startoftranscription|><|cn|><|transcribe|><|cn|><|notimestamps|><|wo_itn|>"
query = f"{audio_url}{sp_prompt}"
audio_info = tokenizer.process_audio(query)
inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
inputs = inputs.to(model.device)
pred = model.generate(**inputs, audio_info=audio_info)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
print(response)

这是终端输出的结果:

Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:10<00:00, 1.17s/it]
/home/wzp/project/yolov8/modelscope/output.wav<|startoftranscription|><|cn|><|transcribe|><|cn|><|notimestamps|><|wo_itn|><|notimestamps|><|itn|>Hello, please you need to to handle what business.<|endoftext|>

Few-shot Examples

Hi, Is it possible to provide audio-text examples as prompts and then ask question about test audio? Thanks.

关于训练数据问题

我在论文和项目中均未找到训练数据的具体信息,能否告知一下训练数据集的情况,感谢

开源时间提问

Geat work!Is there a timetable for open source to reproduce and checkpoint ?

请问prompt要怎么写才能获得单个task的信息或者想要的task的信息?

我这边想进行情感识别时,将prompt='{audio_url}<|emotion_recognition|>'时,出来的结果是:
assets/audio/1.wav<|emotion_recognition|><|zh|><|transcribe|><|zh|><|notimestamps|><|speaker_meta|>普通话, 女声, 31岁<|startoftranscript|><|zh|><|transcribe|><|zh|><|notimestamps|><|wo_itn|>今天天气真好<|endoftext|>
可以看到上面的结果,包含普通话,性别,年龄和文本,但是就是没有情感,那么写prompt的时候,要怎么写才能获得单个task的信息或者想要的task的信息。

12月7号有什么更新了吗?

2023年12月7号,测试的时候发现huggingface的模型跑不了了,后来发现是mel_filter.npz确实,下载的时候链接不上huggingface的服务器,然后我们在modelscope上git clone了整个qwen-audio-chat的仓库,然后在huggingface的测试代码中换上git clone的路径使用本地模型,发现效果跟12月7号之前直接huggingface自行下载的模型相差甚远,结果混乱。
我们将git clone的mel_filters.npz直接拷贝到huggingface下载的缓存目录下,继续用huggingface的缓存路径测试,效果就正常了。

我们对比了git clone和huggingface的两个模型的MD5值发现有些区别。

但是如果手动下载huggingface的模型,发现md5值跟git clone modelscope的md5值是一样的。

所以,请问,咱们是不是12月7号在所有云端更换了模型?且12月7号之前的模型不可复现了?

关于Clotho AQA数据集?

我观察到clotho aqa数据集中每个问题会问三遍,而每一个的答案有可能是不同的。这似乎是原数据集在标注时的三个标志者进行的不同标注,而原数据集并未进一步处理。请问你们有做处理吗?还是仍然当做三个问题

训练问题

请问,第一阶段训练,通义的embedding是打开的吗?

End of sentence id

Thanks for the great work! I have a small question and it would be great if someone can help.

I was wondering how you deal with the EOS token during pre-training? As far as I know, the Qwen LLM model does not have a dedicated EOS token. Since you are freezing the parameters in the LLM model, how does the model know when to terminate (i.e. producing the EOS token)?

报错,requests.exceptions.HTTPError: Response details: 404 page not found, Request id: ab8a478639c847c6bbb41438e4d8606e

用ModelScope那个调用代码,调用报错了。
提前从本地下载好,然后改个路径下载。然后就报错了

这是从 github上边的 体验代码

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

from modelscope import (
snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch

model_id = '/mnt1/wp/damo_download/Qwen-Audio-Chat'

model_id = 'qwen/Qwen-Audio-Chat'

revision = 'master'

model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
tokenizer.model_dir = model_dir

打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()

打开fp16精度,V100、P100、T4等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()

使用CPU进行推理,需要约32GB内存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()

默认gpu进行推理,需要约24GB显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()

第一轮对话

query = tokenizer.from_list_format([
{'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url
{'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

第二轮对话

response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)

The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

是不支持中文提示词吗

输入重庆话的那个wav数据,默认'what does the person say?'和修改的'what is the content'都生效,但是 ‘那个人说了什么’ 模型无输出拒识

wechat full

新更新的微信群满了,还有新的微信群二维码吗?

可以给一些训练数据示例吗?

非常感谢你们的工作!我们组想使用qwen-audio-chat对学术会议的talk做精细的转录,但是可能受限于基座模型,很多术语的转录还是不太好,所以我们想尝试在特定领域数据上进一步微调,可以参考一下你们训练用的数据的格式吗?谢谢!

use of whisper audio encoder

dear team, I understand that you're using whisper encoder part only as the audio encoder in the model

trying to understand how you're doing it but I am kind of lost. git grep -i whisper only gives me the reference from the READMEs. where / how do you load pretrained weights. any pointers appreciated, thank you

ground

请问论文中的grounding-based 和ground怎么翻译,是什么意思?
to support grounding of speech and audio, and grounding-based QA tasks in Qwen-Audio-Chat, such as
finding the starting and ending time of an audio segment mentioning a person’s name or identifying whether
a sound occurs in the given audio

Infer eval_audio目录下的multi-task eval脚本,发现模型针对batch 解码性能衰减很快,请问是训练时候attention mask 或者tokenizer padding部分处理有问题吗?

AQA task:

Infer batch size 1 : clothoaqa_test ACC_score: 0.5789 clothoaqa_test bi_ACC_score: 0.7522

Infer batch size 8 : clothoaqa_test ACC_score: 0.1811 clothoaqa_test bi_ACC_score: 0.221

把tokenizer padding 改为right时,

Infer batch size 8 : clothoaqa_test ACC_score: 0.2354 clothoaqa_test bi_ACC_score: 0.2966

emotion task:

Infer batch size 1: emotion/_test ACC_score: 0.5509

Infer batch siz 8 : emotion/_test ACC_score: 0.067

想确认一下 ,

  1. 训练的时候 tokenizer padding时使用的 left 还是 right?
  2. attention mask 训练的时候是否失效了?
  3. eval_audio 目录下的脚本是否考虑更新一下?
  4. 现在这种情况应该怎么解决? 开源模型是否更新过?

plus: 开源的chatdemo好像都失效了

确定给的本地模型没问题吗

ModelScope和huggingface上边,Qwen-Audio和Qwen-Audio-chat两个链接下面的模型文件都一样。

而且无论把哪个拉下来,改成调用的绝对路径,都没有办法使用,直接报错

训练超参数相关问题

关于训练有几个问题想请教一下,
1、在预训练和微调阶段,SpecAugment Policy 是LibriSpeech Basic是什么含义呢
2、在预训练阶段,单独设置了Audio encoder learning rate decay为0.95,这是只针对audio encoder的还是预训练阶段所有可训练参数(audio encoder+adapter)呢
3、在预训练阶段采用多少台机器,总共训练了多长时间

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.