MARRS: Multimodal Reference Resolution System
This work addresses the challenge of context-aware dialog systems for users needing on-device processing, but it appears incremental as it builds on existing multimodal and privacy-preserving approaches.
The paper tackles the problem of handling multimodal context in dialog understanding by introducing MARRS, an on-device framework that uses machine learning models for reference resolution and query rewriting, resulting in a unified, lightweight system that preserves user privacy.
Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.