CVDec 18, 2023

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

arXiv:2312.10908v352 citationsh-index: 15CVPR
Originality Incremental advance
AI Analysis

This addresses the need for adaptable visual assistants in dynamic environments, though it is incremental as it builds on existing tool-usage methods.

The paper tackled the problem of visual assistants lacking continual learning by proposing CLOVA, a closed-loop system that updates tools based on human feedback, resulting in performance improvements of 5% in visual question answering and multiple-image reasoning, 10% in knowledge tagging, and 20% in image editing over existing methods.

Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge, we propose CLOVA, a Closed-Loop Visual Assistant, which operates within a framework encompassing inference, reflection, and learning phases. During the inference phase, LLMs generate programs and execute corresponding tools to complete assigned tasks. In the reflection phase, a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly, the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing. These results underscore the significance of the continual learning capability in general visual assistants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes