ALO-VC: Any-to-any Low-latency One-shot Voice Conversion


Converted audio samples

Notation
  • All the source speech samples and target speech samples are unseen during training
  • ALO-VC-R is ALO-VC with d-vector speaker encoder from Resemblyzer
  • ALO-VC-E is ALO-VC with ECAPA-TDNN speaker encoder. The ECAPA-TDNN speaker encoder can improve the performance.

VCTK

Female (p301) → Male (p256)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Female (p268) → Female (p301)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Male (p252) → Female (p268)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Male (p256) → Male (p252)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

LibirSpeech

Female (1988) → Male (251)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Female (1988) → Female (2412)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Male (652) → Female (2412)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Male (652) → Male (251)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Internal Speakers

Female (female 1) → Male (male 2)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Female (female 1) → Female (female 2)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Male (male 1) → Female (female 2)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample

Male (male 1) → Male (male 2)

Source Target VQMIVC
(Baseline)
DiffVC
(Baseline)
ALO-VC-R
(Proposed)
ALO-VC-E
(Proposed)
Sample