Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Ablation study 1

A dataset with a higher number of samples per speaker improves the model's ability to minimize timbre leakage. However, in datasets with massive user samples where each speaker has only 1-2 samples, the problem of timbre leakage becomes significantly prominent.

	Target Speaker	Source Wav	result with pitch-convert	result without pitch-convert
Dataset A
Dataset A
Dataset B
Dataset B
KM4096+DatasetB
KM4096+DatasetB

Ablation study 2

Our proposed method can resolve the timbre leakage issue, but it is necessary to determine which layer of HuBERT features to use.

	Source Wav	Sing reconstructed only by Hubert features
H22
H22
H24
H24

	Target Speaker	Source Wav	result with pitch-convert	result without pitch-convert
H22+KM4096
H22+KM4096
H24+KM4096
H24+KM4096

Final Result

	Target Speaker	Source Wav	result with pitch-convert	result without pitch-convert
Our Proposal H22+KM4096
Our Proposal H22+KM4096
ContentVC +SO-VITS-SVC
ContentVC +SO-VITS-SVC
FreeVC
FreeVC