Introduction
This page provides the audio and image samples shown in the paper as well as some additional samples. All results on the audio-guided image translation, image-guided voice conversion, and latent-guided face & voice generation are obtained by the single proposed model.
Audio-guided image translation
The audio samples for the results shown in Fig.1a in the paper.
Source | ![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Reference | ![]() |
![]() |
![]() |
![]() |
Output | ![]() |
![]() |
![]() |
![]() |
Additional results shown in the supplimental material.
Source | ![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Reference | ![]() |
![]() |
![]() |
![]() |
Output | ![]() |
![]() |
![]() |
![]() |
Image-guided voice conversion
The outputs are produced by converting the source voice using reference images. The images are from the image-only dataset (CelebA-HQ), hence, there is no ground truth voice. The first sample is the results shown in Fig.1b in the paper and rest are present in the supplemental material.
Source | ![]() |
|||
---|---|---|---|---|
Reference | ![]() |
![]() |
![]() |
![]() |
Output |
![]() |
![]() |
![]() |
![]() |
Source | ![]() |
|||
---|---|---|---|---|
Reference | ![]() |
![]() |
![]() |
![]() |
Output |
![]() |
![]() |
![]() |
![]() |
Latent-guided face and voice generation
We sample four latent codes, compute style vectors from the codes using the mapping network, and gnerate faces and voices from two source face and voices using the style vectors. The samples are the results shown in Fig.6 in the paper.
Source | Output 1 | Output 2 | Output 3 | Output 4 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Results on out-of-domain samples
To test the generalizability of the model to out-of-domain samples, we use FFHQ dataset, which is not included during training, for source images of audio-guided image translation. For audio, we use VCTK, LRS3, Wav2Lip, as well as VoxCeleb2, which is unseen during training.