CL SD ASSep 20, 2024

Target word activity detector: An approach to obtain ASR word boundaries without lexicon

Sunit Sivasankaran, Eric Sun, Jinyu Li, Yan Huang, Jing Pan

arXiv:2409.13913v11.0h-index: 15

Originality Incremental advance

AI Analysis

This addresses the problem of scalable word boundary estimation in multilingual ASR for applications requiring precise timing, though it is incremental as it builds on existing embedding and alignment techniques.

The paper tackles the challenge of obtaining word timestamps from end-to-end ASR models without lexicons, proposing a method that uses word embeddings from sub-word tokens and a pretrained model, validated on a multilingual ASR model across five languages with competitive results.

Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.

View on arXiv PDF

Similar