Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

Yukiya Hono^{1, 2}, Kazuna Tsuboi², Kei Sawada², Kei Hashimoto¹, Keiichiro Oura¹, Yoshihiko Nankaku¹, Keiichi Tokuda¹
¹ Nagoya Institute of Technology
² Microsoft Development Co., Ltd.

This paper proposes a novel framework for expressive speech synthesis with a hierarchical multi-grained generative model for modeling ﬁne-grained latent variables, considering the hierarchical linguistic structure, the temporal coherency, and an input text. The word-level latent variables are sampled through the utterance-level and phrase-level latent variables in a step-by-step manner so that we can control the speaking style of the entire utterance despite using the ﬁne-grained latent variables. In this experiment, since the latent variables were set to 2 dimensions, we can visualize the utterance-level latent variables directly and manually specify their values at the synthesis stage. Thus, you can control the speaking style of synthesized speech intuitively while viewing the latent space.

You can play the synthesized speech samples by clicking on the dots in the scatter plots.