Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Ablation study 1

A dataset with a higher number of samples per speaker improves the model's ability to minimize timbre leakage. However, in datasets with massive user samples where each speaker has only 1-2 samples, the problem of timbre leakage becomes significantly prominent.

Target SpeakerSource Wavresult with pitch-convertresult without pitch-convert
Dataset A
Dataset B
KM4096+DatasetB

Ablation study 2

Our proposed method can resolve the timbre leakage issue, but it is necessary to determine which layer of HuBERT features to use.

Source WavSing reconstructed only by Hubert features
H22
H24
Target SpeakerSource Wavresult with pitch-convertresult without pitch-convert
H22+KM4096
H24+KM4096

Final Result

Target SpeakerSource Wavresult with pitch-convertresult without pitch-convert
Our Proposal
H22+KM4096
ContentVC
+SO-VITS-SVC
FreeVC