Ablation study 1
A dataset with a higher number of samples per speaker improves the model's ability to minimize timbre leakage. However, in datasets with massive user samples where each speaker has only 1-2 samples, the problem of timbre leakage becomes significantly prominent.
| Target Speaker | Source Wav | result with pitch-convert | result without pitch-convert | |
|---|---|---|---|---|
| Dataset A | ||||
| Dataset B | ||||
| KM4096+DatasetB | ||||
Ablation study 2
Our proposed method can resolve the timbre leakage issue, but it is necessary to determine which layer of HuBERT features to use.
| Source Wav | Sing reconstructed only by Hubert features | |
|---|---|---|
| H22 | ||
| H24 | ||
| Target Speaker | Source Wav | result with pitch-convert | result without pitch-convert | |
|---|---|---|---|---|
| H22+KM4096 | ||||
| H24+KM4096 | ||||
Final Result
| Target Speaker | Source Wav | result with pitch-convert | result without pitch-convert | |
|---|---|---|---|---|
| Our Proposal H22+KM4096 | ||||
| ContentVC +SO-VITS-SVC | ||||
| FreeVC | ||||