The Binding Effect: Analyzing How Multi-Dimensional Cues Form Gender Bias in Instruction TTS
This addresses bias risks in generative speech for users and developers, but it is incremental as it builds on existing bias analysis with a compositional approach.
The study tackled gender bias in Instruction Text-to-Speech by modeling prompts with multi-dimensional social cues, revealing systematic interaction effects that univariate methods miss, and showing that generic diversity prompting fails to override these biases.
Current bias evaluations in Instruction Text-to-Speech (ITTS) often rely on univariate testing, overlooking the compositional structure of social cues. In this work, we investigate gender bias by modeling prompts as combinations of Social Status, Career stereotypes, and Persona descriptors. Analyzing open-source ITTS models, we uncover systematic interaction effects where social dimensions modulate one another, creating complex bias patterns missed by univariate baselines. Crucially, our findings indicate that these biases extend beyond surface-level artifacts, demonstrating strong associations with the semantic priors of pre-trained text encoders and the skewed distributions inherent in training data. We further demonstrate that generic diversity prompting is insufficient to override these entrenched patterns, underscoring the need for compositional analysis to diagnose latent risks in generative speech.