SDLGASJan 22

Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems

arXiv:2601.15676v1h-index: 7
Originality Incremental advance
AI Analysis

This addresses the problem of efficient and accurate audio reasoning on edge infrastructure for applications requiring low latency and privacy, representing a domain-specific incremental improvement.

The paper tackles the trade-off between perception depth and computational efficiency for Audio-Language Models on edge systems by proposing CoFi-Agent, a hybrid architecture that uses local processing with conditional cloud refinement, improving accuracy from 27.20% to 53.60% on the MMAR benchmark.

Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline. Overall, CoFi-Agent bridges the perception gap via tool-enabled, conditional edge-cloud collaboration under practical system constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes