Speech samples to support ICASSP 2017 paper "Adapting and Controlling DNN-based Speech Synthesis using Input Codes".
7 models were used to generate various speech samples using different strategies:
Speaker Code (S) | Gender Code (G) | Age Code (A) | ||||
---|---|---|---|---|---|---|
Models | Type | Size | Type | Size | Type | Size |
ONE-S | One-hot | 112 | N/A | N/A | N/A | N/A |
ONE-SGA' | One-hot | 112 | One-hot | 2 | One-hot | 7 |
ONE-SGA | One-hot | 112 | Numeric | 1 | Numeric | 1 |
RND112-SGA | Random | 112 | Numeric | 1 | Numeric | 1 |
RND008-SGA | Random | 8 | Numeric | 1 | Numeric | 1 |
DCC112-SGA | DCC | 112 | Numeric | 1 | Numeric | 1 |
DCC008-SGA | DCC | 8 | Numeric | 1 | Numeric | 1 |
Samples for speakers in training set. Natural is recording speech. 'a' indicates an average value for that feature while 'c' indicated correct value.
1 | 2 | 3 | 4 | |
---|---|---|---|---|
Natural | ► Play | ► Play | ► Play | ► Play |
ONE-a | ► Play | ► Play | ► Play | ► Play |
ONE-c | ► Play | ► Play | ► Play | ► Play |
ONE-ccc' | ► Play | ► Play | ► Play | ► Play |
ONE-ccc | ► Play | ► Play | ► Play | ► Play |
RND112-ccc | ► Play | ► Play | ► Play | ► Play |
RND008-ccc | ► Play | ► Play | ► Play | ► Play |
DCC112-ccc | ► Play | ► Play | ► Play | ► Play |
DCC008-ccc | ► Play | ► Play | ► Play | ► Play |
Samples for speakers not included in training set. 'e' indicated the estimated value, found by using back-propagation algorithm, of that feature.
1 | 2 | 3 | 4 | |
---|---|---|---|---|
Natural | ► Play | ► Play | ► Play | ► Play |
ONE-a | ► Play | ► Play | ► Play | ► Play |
ONE-e | ► Play | ► Play | ► Play | ► Play |
ONE-ecc' | ► Play | ► Play | ► Play | ► Play |
ONE-ecc | ► Play | ► Play | ► Play | ► Play |
RND112-ecc | ► Play | ► Play | ► Play | ► Play |
RND008-ecc | ► Play | ► Play | ► Play | ► Play |
DCC112-ecc | ► Play | ► Play | ► Play | ► Play |
DCC008-ecc | ► Play | ► Play | ► Play | ► Play |
Samples of speakers included in training set with gender code was switched from Male to Female and vice versa. In case of model ONE-SGA' extreme value cannot be used
Male | Female | ||
---|---|---|---|
Natural | ► Play | Natural | ► Play |
ONE-c | ► Play | ONE-c | ► Play |
ONE-cFc' | ► Play | ONE-cMc' | ► Play |
ONE-cFc | ► Play | ONE-cMc | ► Play |
RND112-cFc | ► Play | RND112-cMc | ► Play |
RND008-cFc | ► Play | RND008-cMc | ► Play |
DCC112-cFc | ► Play | DCC112-cMc | ► Play |
DCC008-cFc | ► Play | DCC008-cMc | ► Play |
Samples of speakers in training set. A male speaker with age in range of 41-50 was chosen. In model ONE-SGA' values 15 and 75 were used as they are 1-of-k vector and cannot be assigned an extreme value.
Male (41-50) | |||
---|---|---|---|
Natural | ► Play | ||
ONE-c | ► Play | ||
ONE-cc15' | ► Play | ONE-cc75' | ► Play |
ONE-cc05 | ► Play | ONE-cc85 | ► Play |
RND112-cc05 | ► Play | RND112-cc85 | ► Play |
RND008-cc05 | ► Play | RND008-cc85 | ► Play |
DCC112-cc05 | ► Play | DCC112-cc85 | ► Play |
DCC008-cc05 | ► Play | DCC008-cc85 | ► Play |
The rest of samples were generated using DCC008-SGA model. In this section value of Speaker Code, Age Code and Gender Code are interpolated from one value to another in a single utterance
21-30 years old Male | |
---|---|
Sample 1 | |
Sample 2 |
61-70 years old Female | |
---|---|
Sample 1 | |
Sample 2 |
The Gender Code or Age Code was interpolated from one extreme value to another, while the others 2 codes was keep with the correct value.
Sample 1 | |
---|---|
DCC008-ccc | |
Gender (-2->3) | |
Age (-50->200) |
Sample 2 | |
---|---|
DCC008-ccc | |
Gender (-2->3) | |
Age (-50->200) |