CLSDASJul 5, 2024

TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR

arXiv:2407.04444v225 citationsh-index: 31Has Code
AI Analysis

This work addresses the inefficiency of cascaded pipelines in conversational AI for applications requiring integrated speech and language processing, offering a more streamlined approach.

The paper tackles the problem of fragmented conversational intelligence pipelines by introducing TokenVerse, a single Transducer-based model that unifies speech and NLP tasks, improving ASR by up to 7.7% in relative WER and outperforming cascaded pipelines in tasks like speaker change detection, endpointing, and NER.

In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes