qwenlm / qwen-audio Goto Github PK

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

License: Other

Python 100.00%

qwen-audio's Introduction

中文｜ English

Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:

Fundamental audio models: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
Multi-task learning framework for all types of audios: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
Strong Performance: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.

We release two models of the Qwen-Audio series soon:

Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and Whisper-large-v2 as the initialization of the audio encoder.
Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.

News and Updates

2023.11.30 🔥 We have released the checkpoints of both Qwen-Audio and Qwen-Audio-Chat on ModelScope and Hugging Face.
2023.11.15 🎉 We released a paper for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.

Evaluation

We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:

The below is the overal performance：

The details of evaluation are as follows:

Automatic Speech Recognition

Dataset	Model	Results (WER)
Dataset	Model	dev-clean	dev-othoer	test-clean	test-other
Librispeech	SpeechT5	2.1	5.5	2.4	5.8
	SpeechNet	-	-	30.7	-
	SLM-FT	-	-	2.6	5.0
	SALMONN	-	-	2.1	4.9
	Qwen-Audio	1.8	4.0	2.0	4.2

Dataset	Model	Results (WER)
Dataset	Model	dev	test
Aishell1	MMSpeech-base	2.0	2.1
	MMSpeech-large	1.6	1.9
	Paraformer-large	-	2.0
	Qwen-Audio	1.2 (SOTA)	1.3 (SOTA)

Dataset	Model	Results (WER)
Dataset	Model	Mic	iOS	Android
Aishell2	MMSpeech-base	4.5	3.9	4.0
	Paraformer-large	-	2.9	-
	Qwen-Audio	3.3	3.1	3.3

Soeech-to-text Translation

Dataset	Model	Results （BLUE)
Dataset	Model	en-de	de-en	en-zh	zh-en	es-en	fr-en	it-en
CoVoST2	SALMMON	18.6	-	33.1	-	-	-	-
	SpeechLLaMA	-	27.1	-	12.3	27.9	25.2	25.9
	BLSP	14.1	-	-	-	-	-	-
	Qwen-Audio	25.1	33.9	41.5	15.7	39.7	38.5	36.0

Automatic Audio Caption

Dataset	Model	Results
Dataset	Model	CIDER	SPICE	SPIDEr
Clotho	Pengi	0.416	0.126	0.271
Clotho	Qwen-Audio	0.441	0.136	0.288

Speech Recognition with Word-level Timestamp

Dataset	Model	AAC (ms)
Industrial Data	Force-aligner	60.3
	Paraformer-large-TP	65.3
	Qwen-Audio	51.5 (SOTA)

Automatic Scene Classification

Dataset	Model	ACC
Cochlscene	Cochlscene	0.669
Cochlscene	Qwen-Audio	0.795 (SOTA)
TUT2017	Pengi	0.353
TUT2017	Qwen-Audio	0.649

Speech Emotion Recognition

Dataset	Model	ACC
Meld	WavLM-large	0.542
Meld	Qwen-Audio	0.557

Audio Question & Answer

Dataset	Model	Results
Dataset	Model	ACC	ACC (binary)
ClothoAQA	ClothoAQA	0.542	0.627
	Pengi	-	0.645
	Qwen-Audio	0.579	0.749

Vocal Sound Classification

Dataset	Model	ACC
VocalSound	CLAP	0.4945
	Pengi	0.6035
	Qwen-Audio	0.9289 (SOTA)

Music Note Analysis

Dataset	Model	NS. Qualities (MAP)	NS. Instrument (ACC)
NSynth	Pengi	0.3860	0.5007
NSynth	Qwen-Audio	0.4742	0.7882

We have provided all evaluation scripts to reproduce our results. Please refer to eval_audio/EVALUATION.md for details.

Evaluation of Chat

To evaluate the chat abilities of Qwen-Audio-Chat, we provide TUTORIAL and demo for users.

Requirements

python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users)
FFmpeg

Quickstart

Below, we provide simple examples to show how to use Qwen-Audio and Qwen-Audio-Chat with 🤖 ModelScope and 🤗 Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.

pip install -r requirements.txt

Now you can start with ModelScope or Transformers. For more usage, please refer to the tutorial. Qwen-Audio models currently perform best with audio clips under 30 seconds.

🤗 Transformers

To use Qwen-Audio-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url
    {'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2nd dialogue turn
response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)
# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

Running Qwen-Audio pretrained base model is also simple.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
audio_url = "assets/audio/1272-128104-0000.flac"
sp_prompt = "<|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>"
query = f"<audio>{audio_url}</audio>{sp_prompt}"
audio_info = tokenizer.process_audio(query)
inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
inputs = inputs.to(model.device)
pred = model.generate(**inputs, audio_info=audio_info)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
print(response)
# <audio>assets/audio/1272-128104-0000.flac</audio><|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>mister quilting is the apostle of the middle classes and we are glad to welcome his gospel<|endoftext|>

In the event of a network issue while attempting to download model checkpoints and codes from Hugging Face, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# Downloading model checkpoint to a local dir model_dir
model_id = 'qwen/Qwen-Audio-Chat'
revision = 'master'
model_dir = snapshot_download(model_id, revision=revision)

# Loading local checkpoints
# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="cuda",
    trust_remote_code=True
).eval()

🤖 ModelScope

ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:

from modelscope import (
    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch
model_id = 'qwen/Qwen-Audio-Chat'
revision = 'master'

model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
    tokenizer.model_dir = model_dir
# use bf16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()
# use CPU
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# use gpu
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url
    {'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2st dialogue turn
response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)
# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

Demo

Web UI

We provide code for users to build a web UI demo. Before you start, make sure you install the following packages:

pip install -r requirements_web_demo.txt

Then run the command below and click on the generated link:

python web_demo_audio.py

FAQ

If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.

We Are Hiring

If you are interested in joining us as full-time or intern, please contact us at [email protected].

License Agreement

Researchers and developers are free to use the codes and model weights of both Qwen-Audio and Qwen-Audio-Chat. We also allow their commercial use. Check our license at LICENSE for more details.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}

Contact Us

If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

qwen-audio's People

Contributors

Stargazers

Watchers

Forkers

zeroxclem nuaazs eltociear sorokinvld hadleymci standardgalactic jeffara tomchapin tonywhite11 2132660698 evelynmitchell wlmsoft runngezhang ychaolee ai-dialogos-chatbot-with-llms lekj tiansiyuan shiyuzh2007 happyxy camenduru ishine gabrielhaohao zwglory snoopycn tonywang-sh shendlcode devbox10 learnuidev techthiyanes render-ai coderwpf lavineleo yanweibing 716999061 zhaoganglxh guazike phjcomp print-varunsharma ccnuzfw jingyonghou dalong0514 macroustc jonylok mazhihao1122 rockycheung marcoyang1998 summer1988 guangkechen berrrrtsu huangxu1991 tianyu-z wavelet2008 huyang19881115 phanigenin tyjsnz jiahuigeng smilingweixiao speechgpt road2018 dogpandacat zhaopufeng hjanime great-wind zhangchunsheng-science lih2022 giovannimoser zhaoning1987 m2moiz hbcbh1999 yuejiang-han astra-zhao xiaozhiob petitetech xiaoyeye668 sunshinewlz starmagic albertwang001

qwen-audio's Issues

SFT use lora? or finetune all parameters?

Thanks for your great works！

I'm interesting in adding a new task into pretrain stage , can you offer some advice or some refers?

And also, i want to know did you finetune all llm paramters in SFT or finetune only lore?

If not， why did you use model parallelism =2 ?

是否考虑加入whisper.cpp的支持？

如题。

low gender classification accuracy

The model seems to not even able to get the gender correctly, a few samples:

question = 'Recognize the gender, age, accent, emotion, and speaking content of the person in the audio, and combine these to answer his/her questions while explaining the reasons for these answers.' # same question as in homepage
query = tokenizer.from_list_format([
    {'audio': audio},
    {'text': question},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

https://vocaroo.com/11gfffDeXNmQ output: The speaker of this audio is a man speaking, in a English, saying, "you know as well as i do the kind of life you offer her.".
https://vocaroo.com/12xbA5EZX60M output: The audio is of a man speaking, in a neutral emotion, saying, "he says no word of happiness.".
https://vocaroo.com/19cMpEhrfHye output: The audio is of a man speaking, in a neutral emotion, saying, "the boy‘s face was very pale as he dropped his hands from penny’s shoulders ; but dundee, from behind the portieres, was not troubling to spy for the moment.".
https://vocaroo.com/12hdkCS6fhYx output: The audio is of a woman speaking, in a neutral emotion, saying, "when zarathustra once told this to his disciples they asked him, and what, o zarathustra, is the moral of thy story? and zarathustra answered them thus.".

The classification accuracy for gender is lower than simple F0 cutoff with this model, which is around 75%.

微信群满了

求更一下二维码

怎样可以让asr的输出带标点呢

qwen-audio处理长音频（五分钟左右）结果只输出前面20秒的文本是什么原因？

代码使用你们文档给出的示例代码，sp_prompt参数怎么改？

What is EvaluationTokenizer?

In evaluate_asr.py, I found the loading of EvaluationTokenizer from evaluate_tokenizer, but where should I find this package?

Tokenizer vocab size mismatch model vocab size

the vocab size in config it "vocab_size": 155947

However, the tokenizer vocab is only 155514

The redundant tokens is use for what？

It's just an advertise. No code No model

Performance degradation on known langauges of whisper

I was trying to do a chat over a audio. Whisper-v2 audio output transcription was good like near perfect but the transcrition output of qwen did not capture whole transcription. I was trying it on hindi audio assuming whisper performance for hindi very good. Next i tried summarising audio but that also did not give proper results. Anyone who has tried qwen for languages supported by whisper?

how can i chat in demo

hi, i want to chat with qwen-audio-chat in https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary with my voice, however, the answer always transcribe or recognize my speech, e.g., for FAQ. I also try to input some paired text prompt, such as "answer me" or "chat with me", but still useless. So how can I experience your voice conversation capabilities? Thx a lot 👯

关于粤语支持

请问目前模型对粤语的支持如何？

关于Output Instruction的问题

您好，论文中有提到 “Output Instruction: Lastly, we provide output instruction to further specify the task and desired format for different subtasks, and then the text output begins.”

以下这些Output Instruction在训练和推理阶段是如何使用的？
我的理解是Output Instruction 放在prompt 结尾，如：
query = f"{audio_url}{sp_prompt}"
其中sp_prompt是"<|startofanalysis|><|unknown|><|keyword|><|zh|><|notimestamps|><|wo_itn|><|audioset_ontology|>"
不知道这种理解对不对？

Output Instruction

        "<|caption_audiocaps|>",  # Audiocaps caption style
        "<|caption_clotho|>",  # Clotho caption style
        "<|audioset_ontology|>",  # Audioset ontology style
        "<|caption_plain|>",  # plain caption
        "<|itn|>",  # inversed text normalized
        "<|wo_itn|>",  # without inversed text normalized
        "<|startofentityvalue|>",
        "<|endofentityvalue|>",
        "<|startofentitytype|>",
        "<|endofentitytype|>",
        "<|named_entity_recognition|>",  # named entity recognition task
        "<|audio_grounding|>",
        "<|startofword|>",
        "<|endofword|>",
        "<|delim|>",  # delimiter of timestamps pair in audio grounding
        "<|emotion_recognition|>",  # emotion recognition
        "<|music_description|>",  # music description
        "<|note_analysis|>",  # note analysis
        "<|pitch|>",  # note analysis: pitch
        *[f"<|midi_pitch_{i}|>" for i in range(128)],  # midi pitch 0-127
        "<|velocity|>",  # note analysis: velocity
        *[f"<|midi_velocity_{i}|>" for i in range(128)],  # midi velocity 0-127
        "<|sonic|>",  # note analysis:  sonic
        "<|instrument|>",  # note analysis:  instrument
        "<|speaker_meta|>",  # meta information of speaker
        "<|song_meta|>",  # meta information of song
        "<|question|>",  # AQA: question
        "<|answer|>",  # AQA: answer
        "<|choice|>",  # AQA: answer choice
        "<|scene|>",  # scene recognition
        "<|event|>",  # sound event
        "<|vocal_classification|>",  # vocal classification
        "<|speech_understanding|>",  # speech language understanding
        "<|scenario|>",  # speech language understanding: scenario
        "<|action|>",  # speech language understanding: action
        "<|entities|>",  # speech language understanding: entities
        "<|speech_edit|>",  # speech edit

allow_pickle=False

请问我在运行README中的示例代码时，为何会出现下述错误。
ValueError: Cannot load file containing pickled data when allow_pickle=False
我查找教程发现需要修改所产生缓存文件的代码，但修改其代码后再运行发现被新的缓存覆盖，现在不知道该如何修改了。

Mac M1 runs painfully slow

I wrote a script to make an inference on an audio file, and it took 30 minutes to get a response:

script:

`
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

Initialize model

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cpu", trust_remote_code=True).eval()

Set up audio file

audio_file = "file.mp3"

Define helper validation function

def validate_response(text):
if ',' not in text:
print("Error - response did not contain comma delimited list")
return False
return True

Time full execution

start = time.time()

Attempt audio analysis

try:
query = tokenizer.from_list_format([
{'audio': audio_file},
{'text': 'Describe this audio with 5 comma-separated adjectives'},
])
response, history = model.chat(tokenizer, query=query, history=None)

# Print full response and validate
print(response)
valid = validate_response(response)

except Exception as e:
print(f"Error: {e}")
valid = False

end = time.time()
elapsed = end - start

Print total time

print(f"Elapsed time: {elapsed} seconds")

Check if the response was valid

if not valid:
print("Issues with audio analysis")`

result:

`python caption_audio.py
audio_start_id: 155164, audio_end_id: 155165, audio_pad_id: 151851.
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████| 9/9 [00:00<00:00, 17.97it/s]
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Audio 1:file.mp3
Describe this audio with 5 comma-separated adjectives<|im_end|>
<|im_start|>assistant

Funky, groovy, energetic, hip-hop, dance
Elapsed time: 1812.089405298233 seconds`

Also inconsistent results running the same script a subsequent time:

second result:

`python caption_audio.py
audio_start_id: 155164, audio_end_id: 155165, audio_pad_id: 151851.
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████| 9/9 [00:00<00:00, 9.99it/s]
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Audio 1:file.mp3
Describe this audio with 5 comma-separated adjectives<|im_end|>
<|im_start|>assistant

This is a funky, groovy, upbeat, electronic, electro, electronic beats, electronic music, electronic dance, electronic instrumental, electronic background, electronic soundtrack, electronic soundtrack, electro hip hop, electro hip hop beats, electro hip hop instrumental, electro rap, electro rap beats, electro rap instrumental, electronic music for tv, electronic music for radio, electronic music for advertising, electronic music for games, electronic music for movies, electronic music for youtube, electronic music for corporate use, electronic music for commercial use, electronic music for business, electronic music for presentations, electronic music for websites, electronic music for apps, electronic music for software, electronic music for advertising, electronic music for marketing, electronic music for product presentation, electronic music for corporate presentations, electronic music for business videos, electronic music for product videos, electronic music for commercials, electronic music for radio, electronic music for tv, electronic music for films, electronic music for viral marketing, electronic music for web advertisements, electronic music for youtube videos, electronic music for social media, electronic music for apps, electronic music for mobile, electronic music for corporate, electronic music for background, electronic music for fashion, electronic music for lifestyle, electronic music for beauty, electronic music for food, electronic music for travel, electronic music for health, electronic music for fitness, electronic music for meditation, electronic music for relaxation, electronic music for studying, electronic music for working, electronic music for sleeping, electronic music for dancing, electronic music for fun, electronic music for playing, electronic music for singing, electronic music for podcast, electronic music for video games, electronic music for background music, electronic music for club, electronic music for rapping, electronic music for beat-making, electronic music for producing, electronic music for composing, electronic music for singing, electronic music for playing, electronic music for dancing, electronic music for fun, electronic music for playing, electronic music for singing, electronic music for producing, electronic music for composing, electronic music for radio, electronic music for tv, electronic music for films, electronic music for viral marketing, electronic music for web advertisements, electronic music for youtube videos, electronic music for social media, electronic music for apps, electronic music for mobile, electronic music for corporate, electronic music for background, electronic music for fashion, electronic music for lifestyle, electronic music for beauty, electronic music for food, electronic music for travel, electronic music for health, electronic music for fitness, electronic music for meditation, electronic music for relaxation, electronic music for studying, electronic music for working, electronic music for sleeping, electronic music for
Elapsed time: 60488.493457078934 seconds`

any tips to make this run more efficiently?

复现实验结果有差距

我这边在复现aishell1上的结果时，出现了一定程度的不遵循指令现象（比如我未指定时间戳，但结果仍然输出），最终结果不做去除特殊token的话差了绝对0.63个点，去除后差了绝对0.24，这里展示一下不遵循我指令的现象


最后，求一个微信群二维码，我的微信：royd99

有量化后的版本吗

如题。谢谢

问题请教，关于gradio的问题，我在本地部署好了，想在手机上使用，显示找不到麦克风

部署的电脑上可以正常使用，但是想在别的电脑和手机上使用该服务，就提示找不到麦克风

Qwen-Audio给的示例Demo输入本地音频文件没有跑出转写的文本结果？能提供相应的例子吗

这是我的代码
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
import re
import os
import glob
import time

torch.manual_seed(1234)

model_path="/home/wzp/.cache/modelscope/hub/qwen/Qwen-Audio"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()

打开fp16精度，V100、P100、T4等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()

使用CPU进行推理，需要约32GB内存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()

默认gpu进行推理，需要约24GB显存

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda", trust_remote_code=True, bf16=True).eval()

可指定不同的生成长度、top_p等相关超参（transformers 4.32.0及以上无需执行此操作）

model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

audio_url = "/home/wzp/project/yolov8/modelscope/output.wav"
sp_prompt = "<|startoftranscription|><|cn|><|transcribe|><|cn|><|notimestamps|><|wo_itn|>"
query = f"{audio_url}{sp_prompt}"
audio_info = tokenizer.process_audio(query)
inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
inputs = inputs.to(model.device)
pred = model.generate(**inputs, audio_info=audio_info)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
print(response)

这是终端输出的结果：

Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:10<00:00, 1.17s/it]
/home/wzp/project/yolov8/modelscope/output.wav<|startoftranscription|><|cn|><|transcribe|><|cn|><|notimestamps|><|wo_itn|><|notimestamps|><|itn|>Hello, please you need to to handle what business.<|endoftext|>

Few-shot Examples

Hi, Is it possible to provide audio-text examples as prompts and then ask question about test audio? Thanks.

请问是否支持 VLLM 等api部署

请问是否兼容 vllm 起服务成openai的接口呢？

关于训练数据问题

我在论文和项目中均未找到训练数据的具体信息，能否告知一下训练数据集的情况，感谢

chat模型，相同文本问题，不同音频，每次ASR返回结果都一样

How to predict the task tag in finetuning stage?

I'm wondering that on the second stage to let qwen-audio gain chatting abilities(sft stage), how can the model predict the task tag in the first stage?

开源时间提问

Geat work！Is there a timetable for open source to reproduce and checkpoint ?

微调代码可以开源一下不

以及如何自己在自己的数据集上微调的方法

请问prompt要怎么写才能获得单个task的信息或者想要的task的信息？

我这边想进行情感识别时，将prompt='{audio_url}<|emotion_recognition|>'时，出来的结果是：
assets/audio/1.wav<|emotion_recognition|><|zh|><|transcribe|><|zh|><|notimestamps|><|speaker_meta|>普通话, 女声, 31岁<|startoftranscript|><|zh|><|transcribe|><|zh|><|notimestamps|><|wo_itn|>今天天气真好<|endoftext|>
可以看到上面的结果，包含普通话，性别，年龄和文本，但是就是没有情感，那么写prompt的时候，要怎么写才能获得单个task的信息或者想要的task的信息。

哪里可以进行指令微调呢？什么时候开放有具体排期吗？

支持本地api调用吗？

如题，是否支持本地API调用？可以加一个本地http调用示例吗

12月7号有什么更新了吗？

2023年12月7号，测试的时候发现huggingface的模型跑不了了，后来发现是mel_filter.npz确实，下载的时候链接不上huggingface的服务器，然后我们在modelscope上git clone了整个qwen-audio-chat的仓库，然后在huggingface的测试代码中换上git clone的路径使用本地模型，发现效果跟12月7号之前直接huggingface自行下载的模型相差甚远，结果混乱。
我们将git clone的mel_filters.npz直接拷贝到huggingface下载的缓存目录下，继续用huggingface的缓存路径测试，效果就正常了。

我们对比了git clone和huggingface的两个模型的MD5值发现有些区别。

但是如果手动下载huggingface的模型，发现md5值跟git clone modelscope的md5值是一样的。

所以，请问，咱们是不是12月7号在所有云端更换了模型？且12月7号之前的模型不可复现了？

关于Clotho AQA数据集？

我观察到clotho aqa数据集中每个问题会问三遍，而每一个的答案有可能是不同的。这似乎是原数据集在标注时的三个标志者进行的不同标注，而原数据集并未进一步处理。请问你们有做处理吗？还是仍然当做三个问题

训练问题

请问，第一阶段训练，通义的embedding是打开的吗？

可以问一下微调代码的公开的计划嘛？预计什么时候能开源呢？非常感谢！！！

End of sentence id

Thanks for the great work! I have a small question and it would be great if someone can help.

I was wondering how you deal with the EOS token during pre-training? As far as I know, the Qwen LLM model does not have a dedicated EOS token. Since you are freezing the parameters in the LLM model, how does the model know when to terminate (i.e. producing the EOS token)?

is that possible to do tts and speaker diarizatoin?

长音频如何处理？长音频调用只输出一半对应的文本

报错，requests.exceptions.HTTPError: Response details: 404 page not found, Request id: ab8a478639c847c6bbb41438e4d8606e

用ModelScope那个调用代码，调用报错了。
提前从本地下载好，然后改个路径下载。然后就报错了

这是从 github上边的体验代码

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

from modelscope import (
snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch

model_id = '/mnt1/wp/damo_download/Qwen-Audio-Chat'

model_id = 'qwen/Qwen-Audio-Chat'

revision = 'master'

model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
tokenizer.model_dir = model_dir

打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()

打开fp16精度，V100、P100、T4等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()

使用CPU进行推理，需要约32GB内存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()

默认gpu进行推理，需要约24GB显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()

第一轮对话

query = tokenizer.from_list_format([
{'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url
{'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

第二轮对话

response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)

The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

有onnx格式的模型吗

如题，谢谢。因为在在移动设备上跑

是不支持中文提示词吗

输入重庆话的那个wav数据，默认'what does the person say?'和修改的'what is the content'都生效，但是 ‘那个人说了什么’ 模型无输出拒识

wechat full

新更新的微信群满了，还有新的微信群二维码吗？

请问Qwen-audio的训练速度，阿里官方达到多少？

阿里在训练Qwen-audio-chat模型的时候，机器训练的性能调优到了什么程度：比如1万小时音频用8卡A100，训练1个epoch，需要多长时间？

可以给一些训练数据示例吗？

非常感谢你们的工作！我们组想使用qwen-audio-chat对学术会议的talk做精细的转录，但是可能受限于基座模型，很多术语的转录还是不太好，所以我们想尝试在特定领域数据上进一步微调，可以参考一下你们训练用的数据的格式吗？谢谢！

use of whisper audio encoder

dear team, I understand that you're using whisper encoder part only as the audio encoder in the model

trying to understand how you're doing it but I am kind of lost. git grep -i whisper only gives me the reference from the READMEs. where / how do you load pretrained weights. any pointers appreciated, thank you

qwen-audio 微调

我参照qwen-vl写了一个微调代码，能跑起来，但可能有些问题，仅供参考
finetune.zip

如何一个音频中分辨多人对话

类似这种
用户 1：。。。
用户 2：。。。
用户 1：。。。

ground

请问论文中的grounding-based 和ground怎么翻译，是什么意思？
to support grounding of speech and audio, and grounding-based QA tasks in Qwen-Audio-Chat, such as
finding the starting and ending time of an audio segment mentioning a person’s name or identifying whether
a sound occurs in the given audio

Infer eval_audio目录下的multi-task eval脚本，发现模型针对batch 解码性能衰减很快，请问是训练时候attention mask 或者tokenizer padding部分处理有问题吗？

AQA task:

Infer batch size 1 : clothoaqa_test ACC_score: 0.5789 clothoaqa_test bi_ACC_score: 0.7522

Infer batch size 8 : clothoaqa_test ACC_score: 0.1811 clothoaqa_test bi_ACC_score: 0.221

把tokenizer padding 改为right时，

Infer batch size 8 : clothoaqa_test ACC_score: 0.2354 clothoaqa_test bi_ACC_score: 0.2966

emotion task:

Infer batch size 1: emotion/_test ACC_score: 0.5509

Infer batch siz 8 : emotion/_test ACC_score: 0.067

想确认一下，

训练的时候 tokenizer padding时使用的 left 还是 right？
attention mask 训练的时候是否失效了？
eval_audio 目录下的脚本是否考虑更新一下？
现在这种情况应该怎么解决？开源模型是否更新过？

plus: 开源的chatdemo好像都失效了

qwenlm / qwen-audio Goto Github PK

qwen-audio's Introduction

News and Updates

Evaluation

Automatic Speech Recognition

Soeech-to-text Translation

Automatic Audio Caption

Speech Recognition with Word-level Timestamp

Automatic Scene Classification

Speech Emotion Recognition

Audio Question & Answer

Vocal Sound Classification

Music Note Analysis

Evaluation of Chat

Requirements

Quickstart

🤗 Transformers

🤖 ModelScope

Demo

Web UI

FAQ

We Are Hiring

License Agreement

Citation

Contact Us

qwen-audio's People

Contributors

Stargazers

Watchers

Forkers

qwen-audio's Issues

Output Instruction

Initialize model

Set up audio file

Define helper validation function

Time full execution

Attempt audio analysis

Print total time

Check if the response was valid

打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()

打开fp16精度，V100、P100、T4等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()

使用CPU进行推理，需要约32GB内存

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()

默认gpu进行推理，需要约24GB显存

可指定不同的生成长度、top_p等相关超参（transformers 4.32.0及以上无需执行此操作）

model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

这是从 github上边的 体验代码

model_id = 'qwen/Qwen-Audio-Chat'

打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()

打开fp16精度，V100、P100、T4等显卡建议启用以节省显存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()

使用CPU进行推理，需要约32GB内存

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()

默认gpu进行推理，需要约24GB显存

第一轮对话

The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

第二轮对话

The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

Recommend Projects

Recommend Topics

Recommend Org

这是从 github上边的体验代码