Skip to the content.

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Sony Group Coporation


Despite progress in neural vocoders, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. In this demo page, we present some audio samples.

Audio samples

Singer Ground Truth Parallel WaveGAN [1] PriorGrad [2] HPG-2stage (Ours)

Outputs at each sampling rate

The proposed HPG generates samples at each sampling rate of the hierarchy. These are the examples of outputs and ground truth at sampling rate. “HPG w/ GT 6k” is the generated by using the ground truth data for the low-sampling rate condition signal.

Singer GT @ 6k GT @ 24k HPG-2stage @ 6k HPG-2stage @ 24k HPG w/ GT 6k

The effect of the anti-aliasing filer

As discussed in Section 3.1, we applied the anti-aliasing filter to the prediction at the lower sampling rate and use it for conditioning of the model at higher sampling rate. Here are the ablation results of the anti-aliasing filter.

w/o filter w/ filter


[1] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020
[2] S. gil Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior,” in Proc. ICLR, 2022