TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models
This work addresses the underexplored selection of in-context examples for speech recognition, offering a simple method to improve performance in accented, multilingual, and children's speech, but it is incremental as it builds on existing SICL and KNN techniques.
The paper tackled the problem of selecting effective in-context examples for Speech In-Context Learning (SICL) by proposing TICL, a text-embedding KNN pipeline that enhances speech recognition in large multimodal models without fine-tuning, achieving up to 84.7% relative WER reduction across challenging tasks.
Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children's speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.