CLASJul 4, 2022

BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model

arXiv:2207.01718v111 citationsh-index: 42
Originality Incremental advance
AI Analysis

This work addresses a specific prosodic prediction problem for TTS systems, but it is incremental as it builds on existing transformer-based methods for prosody.

The paper tackled predicting contrastive focus on personal pronouns in text-to-speech synthesis, a challenging task requiring semantic and pragmatic knowledge, by finetuning a BERT model on a collected corpus and evaluating its accuracy and controllability in TTS.

Several recent studies have tested the use of transformer language model representations to infer prosodic features for text-to-speech synthesis (TTS). While these studies have explored prosody in general, in this work, we look specifically at the prediction of contrastive focus on personal pronouns. This is a particularly challenging task as it often requires semantic, discursive and/or pragmatic knowledge to predict correctly. We collect a corpus of utterances containing contrastive focus and we evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on these samples. We also investigate how past utterances can provide relevant information for this prediction. Furthermore, we evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes