In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Xulin Fan, Vishal Sunder, Samuel Thomas, Mark Hasegawa-Johnson, Brian Kingsbury, George Saon

arXiv:2604.2281764.3

AI Analysis

For ASR applications requiring precise word timestamps (e.g., captioning, media search), this provides an efficient unified approach without external alignment tools.

This work extends a speech-aware language model to predict word-level timestamps directly alongside transcripts, introducing lightweight training strategies that improve alignment robustness and ASR performance. Experiments show gains in both timestamp accuracy and overall recognition quality.

Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.

View on arXiv PDF

Similar