CLHCASJun 13, 2024

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

arXiv:2406.09617v17 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient multimodal adaptation for device-directed speech detection, offering a parameter-efficient solution with robustness to missing data, though it is incremental in improving existing adaptation methods.

The paper tackles the challenge of adapting pre-trained unimodal large language models to multimodal tasks like device-directed speech detection, proposing a Fusion Low Rank Adaptation (FLoRA) technique that achieves a 22% relative reduction in equal error rate over text-only methods and matches full fine-tuning performance with fewer parameters.

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes