Authors

Suparna De
Email: s.de@surrey.ac.uk

Ionut Bostan
Email: ionut@nquiringminds.com

Nishanth Sastry
Email: n.sastry@surrey.ac.uk


Abstract

Recent studies have outlined the accessibility challenges that blind and visually impaired people face in interacting with social networks, with monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers’ emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-toSpeech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. The proposed system has two core components: an emotion classifier and a speech synthesiser. The emotion classifier utilises a classification model to extract sentiment information from the input text. Leveraging a non-autoregressive neural TTS model, the speech synthesiser generates Mel-spectrograms by incorporating speaker and emotion embeddings derived from the classifier’s output. We employ a Generative Adversarial Network (GAN)-based vocoder to convert the Mel-spectrograms into audible waveforms. One of the key contributions lies in effectively incorporating emotional characteristics into TTS synthesis. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.

This work has been accepted for presentation at the 16th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2024), which will be held from September 2-5, 2024, in Calabria, Italy.


Demo

Welcome to the demonstration page of our Emotion-Aware Text-to-Speech Models. Below, you can listen to audio samples from different TTS models.

Description FastSpeech 2[1] TEMOTTS[2] Our Model
Bikes are fun to ride
Dreams can come true
Friends make life more fun

Emotion Aware Samples

Description FastSpeech 2[1] TEMOTTS[2] Our Model
Blowing out birthday candles makes me feel special!
Her heart felt heavy with sorrow
I am feeling sad

References


  1. Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2021.
  2. Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman, “TEMOTTS: Text-aware Emotional Text-to-Speech with no labels”, Speech & Machine Learning Lab, The University of Texas at Dallas, TX, USA, 2024.