Yukiya Hono1, 2, Kazuna Tsuboi2, Kei Sawada2, Kei Hashimoto1, Keiichiro Oura1, Yoshihiko Nankaku1, Keiichi Tokuda1
1 Nagoya Institute of Technology
2 Microsoft Development Co., Ltd.
This paper proposes a novel framework for expressive speech synthesis with a hierarchical multi-grained generative model for modeling fine-grained latent variables, considering the hierarchical linguistic structure, the temporal coherency, and an input text.
The word-level latent variables are sampled through the utterance-level and phrase-level latent variables in a step-by-step manner so that we can control the speaking style of the entire utterance despite using the fine-grained latent variables.
In this experiment, since the latent variables were set to 2 dimensions, we can visualize the utterance-level latent variables directly and manually specify their values at the synthesis stage.
Thus, you can control the speaking style of synthesized speech intuitively while viewing the latent space.