Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

Yukiya Hono1, 2, Kazuna Tsuboi2, Kei Sawada2, Kei Hashimoto1, Keiichiro Oura1, Yoshihiko Nankaku1, Keiichi Tokuda1
1 Nagoya Institute of Technology
2 Microsoft Development Co., Ltd.

This paper proposes a novel framework for expressive speech synthesis with a hierarchical multi-grained generative model for modeling fine-grained latent variables, considering the hierarchical linguistic structure, the temporal coherency, and an input text. The word-level latent variables are sampled through the utterance-level and phrase-level latent variables in a step-by-step manner so that we can control the speaking style of the entire utterance despite using the fine-grained latent variables. In this experiment, since the latent variables were set to 2 dimensions, we can visualize the utterance-level latent variables directly and manually specify their values at the synthesis stage. Thus, you can control the speaking style of synthesized speech intuitively while viewing the latent space.

You can play the synthesized speech samples by clicking on the dots in the scatter plots.