Skip to the content.

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji


Sony Group Coporation


Introduction

Despite progress in neural vocoders, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. In this demo page, we present some audio samples.

Audio samples

Singer Ground Truth Parallel WaveGAN [1] PriorGrad [2] HPG-2stage (Ours)
F02
NJAT
Elvis
ADIZ
M02
F04
M03
MPUR


Outputs at each sampling rate

The proposed HPG generates samples at each sampling rate of the hierarchy. These are the examples of outputs and ground truth at sampling rate. “HPG w/ GT 6k” is the generated by using the ground truth data for the low-sampling rate condition signal.

Singer GT @ 6k GT @ 24k HPG-2stage @ 6k HPG-2stage @ 24k HPG w/ GT 6k
F02
M02


The effect of the anti-aliasing filer

As discussed in Section 3.1, we applied the anti-aliasing filter to the prediction at the lower sampling rate and use it for conditioning of the model at higher sampling rate. Here are the ablation results of the anti-aliasing filter.

w/o filter w/ filter

Reference

[1] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020
[2] S. gil Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior,” in Proc. ICLR, 2022