CVROMar 31, 2025

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

arXiv:2503.24065v19 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the computational efficiency challenge in VLN for applications like home assistants, offering a method that balances performance and cost, though it appears incremental as it builds on existing transformer-based approaches with new modules.

The paper tackles the problem of high computational costs in Vision-and-Language Navigation (VLN) by proposing COSMO, a novel architecture that integrates state-space and transformer modules with VLN-customized selective state space modules, achieving competitive navigation performance on benchmarks like REVERIE, R2R, and R2R-CE while significantly reducing computational costs.

Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the COmbination of Selective MemOrization (COSMO). Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions. Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes