SD AI ASMay 14, 2025

Introducing voice timbre attribute detection

Jinghao He, Zhengyan Sheng, Liping Chen, Kong Aik Lee, Zhen-Hua Ling

arXiv:2505.09661v29.34 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of interpreting timbre in speech for applications like audio processing and human-computer interaction, but it is incremental as it builds on existing speaker embedding methods.

The paper tackles the problem of explaining voice timbre in speech signals by introducing voice timbre attribute detection (vTAD), a task that uses sensory attributes to describe human perception, and finds that ECAPA-TDNN performs better with seen speakers while FACodec excels with unseen speakers, indicating improved generalization.

This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website https://github.com/vTAD2025-Challenge/vTAD.

View on arXiv PDF Code

Similar