Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

Introduction

This page provides the audio and image samples shown in the paper as well as some additional samples. All results on the audio-guided image translation, image-guided voice conversion, and latent-guided face & voice generation are obtained by the single proposed model.

Audio-guided image translation

The audio samples for the results shown in Fig.1a in the paper.

Source
Reference
Output

Additional results shown in the supplimental material.

Source
Reference
Output

Image-guided voice conversion

The outputs are produced by converting the source voice using reference images. The images are from the image-only dataset (CelebA-HQ), hence, there is no ground truth voice. The first sample is the results shown in Fig.1b in the paper and rest are present in the supplemental material.

Source
Reference
Output

Source
Reference
Output

Latent-guided face and voice generation

We sample four latent codes, compute style vectors from the codes using the mapping network, and gnerate faces and voices from two source face and voices using the style vectors. The samples are the results shown in Fig.6 in the paper.

Source	Output 1	Output 2	Output 3	Output 4

Results on out-of-domain samples

To test the generalizability of the model to out-of-domain samples, we use FFHQ dataset, which is not included during training, for source images of audio-guided image translation. For audio, we use VCTK, LRS3, Wav2Lip, as well as VoxCeleb2, which is unseen during training.

Crossmodal Face and Voice Style Transfer

Demo for cross-modal face and voice style transfer

Introduction

Audio-guided image translation

Image-guided voice conversion

Latent-guided face and voice generation

Results on out-of-domain samples