Hierarchical Diffusion Models for Singing Voice Neural Vocoder
Despite progress in neural vocoders, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. In this demo page, we present some audio samples.
|Singer||Ground Truth||Parallel WaveGAN ||PriorGrad ||HPG-2stage (Ours)|
Outputs at each sampling rate
The proposed HPG generates samples at each sampling rate of the hierarchy. These are the examples of outputs and ground truth at sampling rate. “HPG w/ GT 6k” is the generated by using the ground truth data for the low-sampling rate condition signal.
|Singer||GT @ 6k||GT @ 24k||HPG-2stage @ 6k||HPG-2stage @ 24k||HPG w/ GT 6k|
The effect of the anti-aliasing filer
As discussed in Section 3.1, we applied the anti-aliasing filter to the prediction at the lower sampling rate and use it for conditioning of the model at higher sampling rate. Here are the ablation results of the anti-aliasing filter.
|w/o filter||w/ filter|
 R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020
 S. gil Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior,” in Proc. ICLR, 2022