Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition
This addresses the need for privacy-preserving, low-cost intent understanding in intelligent agents, though it appears incremental as it builds on existing decomposition and fine-tuning techniques.
The paper tackles the problem of accurate intent extraction from UI interaction trajectories for on-device models by introducing a decomposed approach with structured summarization and fine-tuned extraction, achieving performance that surpasses large multi-modal language models.
Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.