Abstruct

Many existing works on singing voice conversion (SVC) require clean recordings of target singer’s voice for training. However, it is often difficult to collect them in advance and singing voices are often distorted with reverb and accompaniment music. In this work, we propose a robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voice using only less than 10s of a reference voice. To this end, we propose a two-stage training method called Robustify. In the first stage, a novel one-shot SVC model based on a generative adversarial network is trained on clean data to ensure the high-quality conversion. In the second stage, enhancement modules are introduced to encoders of the model to improve the robustness against distortions in the feature space. Experimental results show that the proposed method outperforms one-shot SVC baselines for both seen and unseen singers and greatly improves the robustness against the distortions.


One-shot Singing Voice conversion on distorted data

In this section, we show the robustness of our model against distortion. We perform one-shot SVC on samples that have reberb and accompaniment music.
All singers are unseen during the model training.

MUSDB18

The source and reference singers are from the MUSDB18 dataset. Singing voices are extracted from music by the music source separation model called D3Net. Note that the models are trained uging NHSS and NUS48E datasets, so this is samples from unseen domain data.

  Sample 1
AM Contra → Carlos Gonzalez
Sample 2
Mu → AM Contra
Sample 3
Cristina Vane → Mu
Source
Target
Separated Source
Separated Target
w/o Robustify
ROSVC (Ours)

Synthetic Distortion on Reference

The source and reference singers are unseen singers from the NHSS and NUS48E datasets. We syntheticaly distort reference singing voices by applying reverb and mixing music.

  Sample 1
M05(NHSS) → F05(NHSS)
Sample 2
F05(NHSS) → M05(NHSS)
Sample 3
F05(NHSS) → PMAR(NUS48E)
Source
Target
Separated Target
W/O Robustify
ROSVC (Ours)

Synthetic Distortion on Source

The source and reference singers are unseen singers from the NHSS and NUS48E datasets. We syntheticaly distort source singing voices by applying reverb and mixing music.

  Sample 1
F05(NHSS) → PMAR(NUS48E)
Sample 2
ZHIY(NUS48E) → PMAR(NUS48E)
Sample 2
PMAR(NUS48E) → ZHIY(NUS48E)
Source
Target
Separated Source
W/O Robustify
ROSVC (Ours)

Synthetic Distortion on Source and Reference

The source and reference singers are unseen singers from the NHSS and NUS48E datasets. We syntheticaly distort source and reference singing voices by applying reverb and mixing music.

  Sample 1
M05(NHSS) → F05(NHSS)
Sample 2
F05(NHSS) → M05(NHSS)
Sample 3
PMAR(NUS48E) → M05(NHSS)
Source
Target
Separated Source
Separated Target
W/O Robustify
ROSVC (Ours)

Comparision against baselines

In this section, we compare our proposed model ROSVC with the baseline, an extention of UCDSVC [1]. We use clean singing voices of unseen singers from the NHSS and NUS48E datasets.

Female to Female

  Sample 1 (F05(NHSS) → PMAR(NUS48E)) Sample 2 (PMAR(NUS48E) → F05(NHSS))
Source
Target
UCDSVC
ROSVC (Ours)

Female to Male

  Sample 1 (F05(NHSS) → ZHIY(NUS48E)) Sample 2 (PMAR(NUS48E) → M05(NHSS))
Source
Target
UCDSVC
ROSVC (Ours)

Male to Female

  Sample 1 (M05(NHSS) → F05(NHSS)) Sample 2 (ZHIY(NUS48E) → PMAR(NUS48E))
Source
Target
UCDSVC
ROSVC (Ours)

Male to Male

  Sample 1 (M05(NHSS) → ZHIY(NUS48E)) Sample 2 (ZHIY(NUS48E) → M05(NHSS))
Source
Target
UCDSVC
ROSVC (Ours)

Reference

[1] A. Polyak, L. Wolf, Y. Adi, and Y. Taigman, “Unsupervised cross-domain singing voice conversion,” in Proc. ICASSP, 2020.