Towards an Interpretable Representation of Speaker Identity via Perceptual Voice Qualities
This work addresses the challenge of interpretable speech analysis for researchers and practitioners, offering a novel intermediary representation between demographics and low-level features, though it is incremental in building on existing protocols.
The paper tackled the problem of interpreting speech by proposing a perceptual voice quality (PQ) representation for speaker identity, demonstrating that non-experts can perceive these qualities and that they are predictable from various speech representations.
Unlike other data modalities such as text and vision, speech does not lend itself to easy interpretation. While lay people can understand how to describe an image or sentence via perception, non-expert descriptions of speech often end at high-level demographic information, such as gender or age. In this paper, we propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs). By adding gendered PQs to the pathology-focused Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) protocol, our PQ-based approach provides a perceptual latent space of the character of adult voices that is an intermediary of abstraction between high-level demographics and low-level acoustic, physical, or learned representations. Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts, and further demonstrate that the information encoded in a PQ-based representation is predictable by various speech representations.