Hierarchical Diffusion Models for Singing Voice Neural Vocoder

Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Sony Group Coporation

Introduction

Despite progress in neural vocoders, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. In this demo page, we present some audio samples.

Audio samples

Singer	Ground Truth	Parallel WaveGAN [1]	PriorGrad [2]	HPG-2stage (Ours)
F02
NJAT
Elvis
ADIZ
M02
F04
M03
MPUR

Outputs at each sampling rate

The proposed HPG generates samples at each sampling rate of the hierarchy. These are the examples of outputs and ground truth at sampling rate. “HPG w/ GT 6k” is the generated by using the ground truth data for the low-sampling rate condition signal.

Singer	GT @ 6k	GT @ 24k	HPG-2stage @ 6k	HPG-2stage @ 24k	HPG w/ GT 6k
F02
M02

The effect of the anti-aliasing filer

As discussed in Section 3.1, we applied the anti-aliasing filter to the prediction at the lower sampling rate and use it for conditioning of the model at higher sampling rate. Here are the ablation results of the anti-aliasing filter.

w/o filter	w/ filter

Reference

[1] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020
[2] S. gil Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior,” in Proc. ICLR, 2022