The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
For researchers in speech synthesis and voice conversion, this work reveals a critical flaw in a widely used evaluation metric, potentially invalidating prior results that relied on it.
The paper challenges the assumption that emotion embedding similarity (e.g., using emotion2vec) is a valid metric for evaluating emotional expressiveness in speech generation, showing through adversarial tasks and human tests that it misaligns with human perception due to linguistic and speaker interference.
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.