ASCLLGSDApr 14

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

arXiv:2604.2281764.3
AI Analysis

For ASR applications requiring precise word timestamps (e.g., captioning, media search), this provides an efficient unified approach without external alignment tools.

This work extends a speech-aware language model to predict word-level timestamps directly alongside transcripts, introducing lightweight training strategies that improve alignment robustness and ASR performance. Experiments show gains in both timestamp accuracy and overall recognition quality.

Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes