rookiejunchen / fullsubnet-plus Goto Github PK

The official PyTorch implementation of "FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement".

License: Apache License 2.0

Python 97.94% Shell 2.06%

speechenhancement

fullsubnet-plus's Introduction

Hi, I'm Jun Chen 👋

🎓 I’m a student of Tsinghua University.
📚 Working on Speech Enhancement recently.
💼 Worked as Research Intern at MSRA and Huya.
🔭 Currently working at Tencent Ethereal Audio Lab.
🚀 I use daily:
📫 How to reach me: [email protected]

fullsubnet-plus's People

Contributors

Stargazers

Watchers

Forkers

andong-li-speech youngjay0612 ishine newoneincntk gxu82 agangzz leonyang11 boson-lv zhongshijun shaun95 jinmingche wendonggan jeffery-work nirraviv89 miblue119 hbwu-ntu grit1024 doevent alanliudx wendongj mmmmichaelzhang selimcandemirtas maxmax2016 jzi040941 parking7907 lin9x okrio runngezhang zhibinqiu normonisping sherryyu33 yexy1234 aso538 debottam-dutta7 fragrantrookie enucatl kapwing gedebabin adambear mjt1999 yshsu sulaiman-kagumire fangnanwei q-y-tang ai-sherry runngezhang-jx theeraphatwsx2500 baekms icassp-papers freds0 tranka2010 lucasolives nahue-passano vipchengrui

fullsubnet-plus's Issues

AdaptiveAvgPool1d的实时性保证

Hi，attention模块里包含了一个[B, num_channels, T]到[B, num_channels, 1]的AdaptiveAvgPool1d自适应池化操作，这一步会使用整个时间轴上的信息，这个是不是不能够实时化？

While trying to install the requirements, I got the following error shown below. I'm using ubuntu 22.04, and I followed the instructions on the page, and I am using Nvidia GeForce 2080Ti. How can I get over this? is there a specific version I need to install? However, I already tried downgrading pypesq, which still failed.

Collecting pypesq==1.0
Downloading pypesq-1.0.tar.gz (24 kB)
Requirement already satisfied: numpy in ./anaconda3/envs/speech_enhance/lib/python3.6/site-packages (from pypesq==1.0) (1.19.2)
Building wheels for collected packages: pypesq
Building wheel for pypesq (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/username/anaconda3/envs/speech_enhance/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/setup.py'"'"'; file='"'"'/tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-qeext5mj
cwd: /tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/
Complete output (26 lines):
running bdist_wheel
running build
running build_py
file numpy.py (for module numpy) not found
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/pypesq
copying pypesq/init.py -> build/lib.linux-x86_64-3.6/pypesq
file numpy.py (for module numpy) not found
running build_ext
building 'pesq_core' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/pypesq
gcc -pthread -B /home/username/anaconda3/envs/speech_enhance/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy -I/home/username/anaconda3/envs/speech_enhance/include/python3.6m -c pypesq/pesq.c -o build/temp.linux-x86_64-3.6/pypesq/pesq.o
In file included from /home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822,
from /home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pypesq/pesq.c:2:
/home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
17 | #warning "Using deprecated NumPy API, disable it with "
| ^
pypesq/pesq.c:5:10: fatal error: pesq.h: No such file or directory
5 | #include "pesq.h"
| ^~
compilation terminated.
error: command 'gcc' failed with exit status 1

ERROR: Failed building wheel for pypesq
Running setup.py clean for pypesq
Failed to build pypesq
Installing collected packages: pypesq
Running setup.py install for pypesq ... error
ERROR: Command errored out with exit status 1:
command: /home/username/anaconda3/envs/speech_enhance/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/setup.py'"'"'; file='"'"'/tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-zcyemj5u/install-record.txt --single-version-externally-managed --compile --install-headers /home/username/anaconda3/envs/speech_enhance/include/python3.6m/pypesq
cwd: /tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/
Complete output (26 lines):
running install
running build
running build_py
file numpy.py (for module numpy) not found
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/pypesq
copying pypesq/init.py -> build/lib.linux-x86_64-3.6/pypesq
file numpy.py (for module numpy) not found
running build_ext
building 'pesq_core' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/pypesq
gcc -pthread -B /home/choi1022linux/anaconda3/envs/speech_enhance/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/choi1022linux/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy -I/home/choi1022linux/anaconda3/envs/speech_enhance/include/python3.6m -c pypesq/pesq.c -o build/temp.linux-x86_64-3.6/pypesq/pesq.o
In file included from /home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822,
from /home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pypesq/pesq.c:2:
/home/username/anaconda3/envs/speech_enhance/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
17 | #warning "Using deprecated NumPy API, disable it with "
| ^~~~~~~
pypesq/pesq.c:5:10: fatal error: pesq.h: No such file or directory
5 | #include "pesq.h"
| ^~~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /home/username/anaconda3/envs/speech_enhance/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/setup.py'"'"'; file='"'"'/tmp/pip-install-zzkoowzc/pypesq_c363aa2277764f5585f4781bcbe8b6fc/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-zcyemj5u/install-record.txt --single-version-externally-managed --compile --install-headers /home/username/anaconda3/envs/speech_enhance/include/python3.6m/pypesq Check the logs for full command output.

failed to load pre-trained model

I tried to make a quick usage using pre-trained checkpoint, but get the error "magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified."
my torch is 1.7.1 ,python is 3.8 . The command is "python -m speech_enhance.tools.inference -C config/inference.toml -M archive/data.pkl -I ~/mini_test_set/test.wav -O ~/fullout"
What can I do to run the demo successfully?

Model causality

Is the model causal? It seems like during training and during inference the ChannelTimeSenseSELayer is used, where average pooling is taken along the frames axis, or I am supposed to process audio chunk-by-chunk to obtain the honest result with usage of only limited look ahead amount of data?

https://github.com/hit-thusz-RookieCJ/FullSubNet-plus/blob/81e84b43d4f716cda1cd065d608f6c7b6758e791/speech_enhance/audio_zen/model/module/attention_model.py#L57-L71

AttributeError: module 'fullsubnet_plus.model.fullsubnet_plus' has no attribute 'Sub_FullSubNet_Plus'

你好，谢谢你的工作。
我尝试训练遇到以下问题：
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/lee/Documents/software/anaconda3/envs/fsnetplus/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/lee/Desktop/workspace/project/se/ns/tests/FullSubNet-plus/speech_enhance/tools/train.py", line 58, in entry model = initialize_module(config["model"]["path"], args=config["model"]["args"]) File "/home/lee/Desktop/workspace/project/se/ns/tests/FullSubNet-plus/speech_enhance/audio_zen/utils.py", line 91, in initialize_module class_or_function = getattr(module, class_or_function_name) AttributeError: module 'fullsubnet_plus.model.fullsubnet_plus' has no attribute 'Sub_FullSubNet_Plus'

我在工程中全局搜索没有找到这个函数或文件，请问该如何解决？

train.toml出错

你好呀，我在复现你的工作的时候发现，train.toml出错了，我发现是因为snr_range = [-5,20]和metrics = ["WB_PESQ", "NB_PESQ", "STOI", "SI_SDR"]这里有问题，把这两行注释掉就没问题。我在网上没找到解决方法，请问您遇到过这样的问题吗？期待您的回复

num_groups_in_drop_band的问题

学长，我有个问题，num_groups_in_drop_band要是不为1，输出的掩膜维度的F不就变了嘛，后续还原语音的时候，就和原来的幅度谱大小对不上了

About the license for this models

Thank you for sharing your great code. 😺

What is the license for this model? I'd like to cite it to the repository I'm working on if possible, but I want to post the license correctly.
https://github.com/PINTO0309/PINTO_model_zoo
https://github.com/PINTO0309/PINTO_model_zoo/tree/main/254_FullSubNet-plus

Thank you.

small code fix needed in clipping detection

Hey,
first of all I'd like to thank you for this great model and for sharing it on github!
a small bug i found:

as we know, cIRM isn't bounded and thus we are able to get mask amplitudes that are larger than 1.
this can cause clipping in the enhanced signal.

to fix this you check it:

if abs(enhanced).any() > 1:
print(f"Warning: enhanced is not in the range [-1, 1], {name}")

I think you meant:

if (abs(enhanced) > 1).any():
print(f"Warning: enhanced is not in the range [-1, 1], {name}")

after fixing this I see quite a lot of clipping...

https://github.com/hit-thusz-RookieCJ/FullSubNet-plus/blob/a6c89083cd083e729ca3def9a291743e8c3b516b/speech_enhance/audio_zen/inferencer/base_inferencer.py#L148

When I'm training the test loss is zero

When I'm training the test loss is zero. Do you know the reason for this, thank you~

Error occurred while training the FullSubNet-Plus model

Hi,
I encountered an error like that when training FullSubNet-Plus about 100 epoch on DNS dataset. What is the reason and how to solve it.

error:
File "/venv/py365/lib/python3.6/site-packages/librosa/util/utils.py", line 310, in valid_audio
raise ParameterError("Audio buffer is not finite everywhere")
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere.

Looking forward to your reply!!

torch.istft requires a complex-valued input tensor matching the output from stft with return_complex=True

I know this issue because from torch 2.0+ Real datatype inputs are no longer supported. So any solution for this problem?

Check the issue

Dear author,
Thank you for this interesting solution.
Please check the comment from > PINTO0309 in:
PINTO0309/PINTO_model_zoo#187

"The sound transformation model depends on the audio length, but how would you like the input width to be fixed? ONNX with a variable input width will cause an error in Reshape operation and cannot be used."

Additionally, PINTO0309 converted the model in different formats:
https://github.com/PINTO0309/PINTO_model_zoo/tree/main/254_FullSubNet-plus

Thank you.

Will the use of AdaptiveAvgPool1d in MulCA lead to loss of causality?

你好，拜读了fullsubnet_plus 的代码，有个疑问请教：
虽然 look ahead=2，预示着模型参考了未来两帧的信息，但是在 MulCA 中计算各个频带的权重参数使用了AdaptiveAvgPool1d ，是不是意味着参考了一整帧的信息？
尝试使用 AvgPool1d 并且仅使用历史若干长度的信息，从而使得整个模型具备流式部署的可能？这样做性能会下降码？

谢谢

Although look ahead=2 indicates that the model references the information of the next two frames, AdaptiveAvgPool1d is used to calculate the weight parameters of each band in MulCA, does it mean that the information of the whole frame is referenced?

Try using AvgPool1d to reference only a few lengths of history to make the entire model possible for streaming deployment? Does this degrade the code?

有关流式推理的探索。

学长您好，我基于您的基础上在流式推理上进行了探索，也读过ISSUE当中几个有关因为因果性实现不了实时的讨论，并做了一些实践，想请教下您。

然而我一开始尝试的是将一长度为60s的语音，基于以下命令：
ffmpeg -i input.wav -ss 00:00:xx -t 00:00:01 output.wav
编写一个bash脚本，切割成60个.wav文件，通过inference增强后再使用ffmpeg进行拼接。

然而我发现了一个问题：包含人声的片段的1s依旧会得到增强，然而在一些raw语音是静默的片段，却会产生啸叫。
以下三张为语谱图，从上到下依次为原声，直接增强，基于1s为片段的增强拼接合成：

可以在语谱图上发现也会出现一些冲激。

然而并不是只要是silence的片段，就会产生啸叫，为此我做了以下实验：
wav = 0.0000001*np.random.randn(100000,) 生成一个能量极小的白噪声。
采样率为16k，我把其保存成.wav文件再做增强，同样地，也尝试过分割后增强，但是结论是并没有啸叫，只有白噪声本身被增强。

想请问下您基于算法原理，作为作者对这类问题的思考是怎样的？

About the efficiency of the MulCA module

I am trying to reproduce the FullSubNet+ on some speech enhancement datasets. The results are amazing, the noise suppression ability of this method is so good, and very impressive! 🤩🤩🤩

I stumbled across an implementation detail in the paper and code that piqued my curiosity. Regarding the paper in the MulCA module (if I'm not misunderstanding, its code is implemented in ChannelTimeSenseSELayer ). Three concurrently processed nn.Sequential are used here, each sequence in turn contains Depthwise Conv1d, AdaptiveAvgPool1d, and ReLU. These features are then subsequently merged together using fully connected layers.

One question that puzzles me is that if the order of operations is Conv1d and then AdaptiveAvgPool1d is used, based on the distributive law of multiplication, it seems that the process of convolution can be approximated basically by the following simplified form (stride=1):
AvgPool1d(Conv1d(A,weight,bias)) ≈ A.sum(-1)*weight.sum(-1)/(A.shape[-1]-kernel_size+1)+bias + small_sided_error (may be similar to the above formula for subband_num>1)

We might be able to define weight.sum(-1)/(A.shape[-1]-kernel_size+1) and bias as two float32 parameters, then use the Maxout activation function to combine the three-way convolution into a simple channel summation.

Are there any special considerations for the design of MulCA through Conv1d? I think the simplified implementation is very similar to ChannelSELayer.

Question about the output of Quick Start.

I follot the steps of the part Quick Start, the commands as follows:

and the program runs with no bug, but the output is weird, there is only a .toml file. And the file contains no information about the wav, only about the model.

Thanks!

Query regarding Causality of Model

First of all thank you so much for making your implementation public. I have a query regarding causality of the model published.

In the paper it was proposed that the proposed architecture is real-time and i could even see the Inferencer code dealing with chunks of audio. Yet, i came across from one of the comments that the model published in the paper/ implementation available here in Github is non-causal.
Incase if it's not non-causal, would it be possible to list down the changes that are needed to be done to make it causal ? Thanks.

soundfile.LibsndfileError: Error opening 'xx/xx/xx.wav': File contains data in an unknown format.

I find a problem when training my model:

soundfile.LibsndfileError: Error opening 'xx/xx/xx.wav': File contains data in an unknown format.

I run this in Ubuntu

I have tried many methods. What can I do about this problem? Think you!

How do I create my own dataset?

If I want to train my own dataset, what should the structure of the dataset be, and should the names of the files in the clean and noise folders be in one-to-one correspondence, thank you!

Problem in loading your pre-trained checkpoint

I tried to use your pre-trained checkpoints, data.pkl, to inference noisy signals, but found out that there is problem in torch.load() function, indicating failure to load the .pkl file. I searched the Internet and suggestions are installing appropriate torch version. I tried a few versions but it still doesn't work.

Inference is getting Killed automatically.

When I tried to do the inference/test on my sample audio (.wav) file it getting killed.

My Directory Structure

Executing Command

python3 -m speech_enhance.tools.inference  / 
       -C config/inference.toml   -M /home/arvik/Desktop/ParadisoAI/FullSubNet-plus/best_model.tar  /
       -I /home/arvik/Desktop/ParadisoAI/FullSubNet-plus/input  / 
       -O /home/arvik/Desktop/ParadisoAI/FullSubNet-plus/output

Hello, how can I see the full paper？

Control the strength of Enhancement

Can we control the strength of enhancement of Fullsubnet by config this?

OSError: [Errno 22] Invalid argument: '.../FullSubNet-plus/result/output/2022-11-11 01:25:28.toml'

学长您好，可以帮我看一下这个是什么问题导致的吗？我不明白一开始的use specified dataset_dir_list: ['result/data'], instead of in config是什么意思，我放了绝对路径相对路径都不行，感谢您的帮助。
`(speech_enhance) 123@123-Lenovo-Legion-R7000P2020H:/media/123/Axuan2/FullSubNet-plus$ ./A.sh
use specified dataset_dir_list: ['result/data'], instead of in config
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[2022-11-11 01:25:27.873] Using CPU in the experiment.
[2022-11-11 01:25:27.874] Loading inference dataset...
[2022-11-11 01:25:27.883] Loading model...
[2022-11-11 01:25:27.996] 当前正在处理 tar 格式的模型断点，其 epoch 为：194.
[2022-11-11 01:25:28.019] Configurations are as follows:
[2022-11-11 01:25:28.020] [acoustics]
n_fft = 512
win_length = 512
sr = 16000
hop_length = 256

[inferencer]
path = "fullsubnet_plus.inferencer.inferencer.Inferencer"
type = "mag_complex_full_band_crm_mask"

[dataset]
path = "fullsubnet.dataset.dataset_inference.Dataset"

[model]
path = "fullsubnet_plus.model.fullsubnet_plus.FullSubNet_Plus"

[inferencer.args]
n_neighbor = 15

[dataset.args]
dataset_dir_list = [ "result/data",]
sr = 16000

[model.args]
sb_num_neighbors = 15
fb_num_neighbors = 0
num_freqs = 257
look_ahead = 2
sequence_model = "LSTM"
fb_output_activate_function = "ReLU"
sb_output_activate_function = false
channel_attention_model = "TSSE"
fb_model_hidden_size = 512
sb_model_hidden_size = 384
weight_init = false
norm_type = "offline_laplace_norm"
num_groups_in_drop_band = 2
kersize = [ 3, 5, 10,]
subband_num = 1

Traceback (most recent call last):
File "/home/123/anaconda3/envs/speech_enhance/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/123/anaconda3/envs/speech_enhance/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/media/123/Axuan2/FullSubNet-plus/speech_enhance/tools/inference.py", line 37, in
main(configuration, checkpoint_path, output_dir)
File "/media/123/Axuan2/FullSubNet-plus/speech_enhance/tools/inference.py", line 13, in main
inferencer = inferencer_class(
File "/media/123/Axuan2/FullSubNet-plus/speech_enhance/fullsubnet_plus/inferencer/inferencer.py", line 54, in init
super().init(config, checkpoint_path, output_dir)
File "/media/123/Axuan2/FullSubNet-plus/speech_enhance/audio_zen/inferencer/base_inferencer.py", line 59, in init
with open((root_dir / f"{time.strftime('%Y-%m-%d %H:%M:%S')}.toml").as_posix(), "w") as handle:
OSError: [Errno 22] Invalid argument: '/media/123/Axuan2/FullSubNet-plus/result/output/2022-11-11 01:25:28.toml'`

关于论文中的因果性

大佬你好，问个问题，关于因果性，看到代码里边TCNBlock使用的归一化是groupnorm(1, channel)，但是输入norm的数据维度是(B,channel,T)，这样是因果的嘛？这样似乎每个通道里边都包含所有帧的信息。