Abstruct
Many existing works on singing voice conversion (SVC) require clean recordings of target singer’s voice for training. However, it is often difficult to collect them in advance and singing voices are often distorted with reverb and accompaniment music. In this work, we propose a robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voice using only less than 10s of a reference voice. To this end, we propose a two-stage training method called Robustify. In the first stage, a novel one-shot SVC model based on a generative adversarial network is trained on clean data to ensure the high-quality conversion. In the second stage, enhancement modules are introduced to encoders of the model to improve the robustness against distortions in the feature space. Experimental results show that the proposed method outperforms one-shot SVC baselines for both seen and unseen singers and greatly improves the robustness against the distortions.
One-shot Singing Voice conversion on distorted data
In this section, we show the robustness of our model against distortion. We perform one-shot SVC on samples that have reberb and accompaniment music.
All singers are unseen during the model training.
MUSDB18
The source and reference singers are from the MUSDB18 dataset. Singing voices are extracted from music by the music source separation model called D3Net. Note that the models are trained uging NHSS and NUS48E datasets, so this is samples from unseen domain data.
Sample 1 AM Contra → Carlos Gonzalez |
Sample 2 Mu → AM Contra |
Sample 3 Cristina Vane → Mu |
|
---|---|---|---|
Source | |||
Target | |||
Separated Source | |||
Separated Target | |||
w/o Robustify | |||
ROSVC (Ours) |
Synthetic Distortion on Reference
The source and reference singers are unseen singers from the NHSS and NUS48E datasets. We syntheticaly distort reference singing voices by applying reverb and mixing music.
Sample 1 M05(NHSS) → F05(NHSS) |
Sample 2 F05(NHSS) → M05(NHSS) |
Sample 3 F05(NHSS) → PMAR(NUS48E) |
|
---|---|---|---|
Source | |||
Target | |||
Separated Target | |||
W/O Robustify | |||
ROSVC (Ours) |
Synthetic Distortion on Source
The source and reference singers are unseen singers from the NHSS and NUS48E datasets. We syntheticaly distort source singing voices by applying reverb and mixing music.
Sample 1 F05(NHSS) → PMAR(NUS48E) |
Sample 2 ZHIY(NUS48E) → PMAR(NUS48E) |
Sample 2 PMAR(NUS48E) → ZHIY(NUS48E) |
|
---|---|---|---|
Source | |||
Target | |||
Separated Source | |||
W/O Robustify | |||
ROSVC (Ours) |
Synthetic Distortion on Source and Reference
The source and reference singers are unseen singers from the NHSS and NUS48E datasets. We syntheticaly distort source and reference singing voices by applying reverb and mixing music.
Sample 1 M05(NHSS) → F05(NHSS) |
Sample 2 F05(NHSS) → M05(NHSS) |
Sample 3 PMAR(NUS48E) → M05(NHSS) |
|
---|---|---|---|
Source | |||
Target | |||
Separated Source | |||
Separated Target | |||
W/O Robustify | |||
ROSVC (Ours) |
Comparision against baselines
In this section, we compare our proposed model ROSVC with the baseline, an extention of UCDSVC [1]. We use clean singing voices of unseen singers from the NHSS and NUS48E datasets.
Female to Female
Sample 1 (F05(NHSS) → PMAR(NUS48E)) | Sample 2 (PMAR(NUS48E) → F05(NHSS)) | |
---|---|---|
Source | ||
Target | ||
UCDSVC | ||
ROSVC (Ours) |
Female to Male
Sample 1 (F05(NHSS) → ZHIY(NUS48E)) | Sample 2 (PMAR(NUS48E) → M05(NHSS)) | |
---|---|---|
Source | ||
Target | ||
UCDSVC | ||
ROSVC (Ours) |
Male to Female
Sample 1 (M05(NHSS) → F05(NHSS)) | Sample 2 (ZHIY(NUS48E) → PMAR(NUS48E)) | |
---|---|---|
Source | ||
Target | ||
UCDSVC | ||
ROSVC (Ours) |
Male to Male
Sample 1 (M05(NHSS) → ZHIY(NUS48E)) | Sample 2 (ZHIY(NUS48E) → M05(NHSS)) | |
---|---|---|
Source | ||
Target | ||
UCDSVC | ||
ROSVC (Ours) |
Reference
[1] A. Polyak, L. Wolf, Y. Adi, and Y. Taigman, “Unsupervised cross-domain singing voice conversion,” in Proc. ICASSP, 2020.