CVNov 21, 2023

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang, Zheqi Lv, Wenqiao Zhang, Siliang Tang, Yueting Zhuang

arXiv:2311.12890v39.813 citationsh-index: 27

Originality Incremental advance

AI Analysis

This addresses a bottleneck in visual programming for complex, multi-step vision-language tasks, offering an incremental improvement by enabling feedback-driven optimization without task-specific data.

The paper tackles the problem of visual programming methods lacking feedback for optimization in complex tasks by introducing De-fine, a training-free framework that decomposes tasks and refines programs with auto-feedback, resulting in improved logical reasoning performance and more robust programs across various visual tasks.

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more robust programs. Moreover, viewing each feedback module as an independent agent will yield fresh prospects for the field of agent research.

View on arXiv PDF

Similar