CLAICVLGMar 9, 2025

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

arXiv:2503.06514v214 citationsh-index: 8CVPR
Originality Incremental advance
AI Analysis

This addresses the problem of improving multi-step reasoning in VLMs for applications like games and embodied AI, though it appears incremental as it builds on existing fine-tuning and GFlowNet techniques.

The paper tackles the problem of limited solution diversity and generalization in multi-step reasoning tasks for Vision-Language Models (VLMs) by introducing GFlowVLM, a framework that fine-tunes VLMs using Generative Flow Networks (GFlowNets). The result shows enhanced training efficiency, solution diversity, and stronger generalization capabilities on tasks like card games and embodied planning, outperforming prior methods like SFT and RL.

Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes