Streaming Sequence Transduction through Dynamic Compression
This addresses the challenge of optimizing latency, memory, and quality for streaming tasks like speech-to-text, though it appears incremental as it builds on Transformer-based methods.
The paper tackles the problem of efficient sequence-to-sequence transduction over streams by introducing STAR, a Transformer-based model that dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition and outperforming existing methods.
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.