Ablation study 1
A dataset with a higher number of samples per speaker improves the model's ability to minimize timbre leakage. However, in datasets with massive user samples where each speaker has only 1-2 samples, the problem of timbre leakage becomes significantly prominent.
Target Speaker | Source Wav | result with pitch-convert | result without pitch-convert | |
---|---|---|---|---|
Dataset A | ||||
Dataset B | ||||
KM4096+DatasetB | ||||
Ablation study 2
Our proposed method can resolve the timbre leakage issue, but it is necessary to determine which layer of HuBERT features to use.
Source Wav | Sing reconstructed only by Hubert features | |
---|---|---|
H22 | ||
H24 | ||
Target Speaker | Source Wav | result with pitch-convert | result without pitch-convert | |
---|---|---|---|---|
H22+KM4096 | ||||
H24+KM4096 | ||||
Final Result
Target Speaker | Source Wav | result with pitch-convert | result without pitch-convert | |
---|---|---|---|---|
Our Proposal H22+KM4096 | ||||
ContentVC +SO-VITS-SVC | ||||
FreeVC | ||||