De-fine: Decomposing and Refining Visual Programs with Auto-Feedback
This addresses a bottleneck in visual programming for complex, multi-step vision-language tasks, offering an incremental improvement by enabling feedback-driven optimization without task-specific data.
The paper tackles the problem of visual programming methods lacking feedback for optimization in complex tasks by introducing De-fine, a training-free framework that decomposes tasks and refines programs with auto-feedback, resulting in improved logical reasoning performance and more robust programs across various visual tasks.
Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more robust programs. Moreover, viewing each feedback module as an independent agent will yield fresh prospects for the field of agent research.