Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation
This work addresses the problem of enabling speaker verification in speech-aware LLMs for applications requiring both natural language and speaker identity processing, representing an incremental improvement by augmenting existing models.
The paper tackled the problem of whether speech-aware large language models (LLMs) encode speaker identity by proposing a model-agnostic scoring protocol for evaluation and a lightweight augmentation method. The result showed that baseline LLMs had weak speaker discrimination (EERs above 20% on VoxCeleb1), while the augmented ECAPA-LLM achieved 1.03% EER on VoxCeleb1-E, approaching dedicated system performance.
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.