AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs
This work addresses the challenge of complex multimodal classification for large-scale industrial applications, representing an incremental advancement in enhancing MLLM reasoning capabilities.
The paper tackles the problem of multimodal large language models (MLLMs) struggling with complex logical reasoning by introducing AgentPS, a framework that integrates agentic process supervision through sequential reasoning over ancillary questions during fine-tuning, achieving substantial improvements over baseline MLLMs on public benchmarks and proprietary datasets with minimal performance degradation when using MLLM-generated labels instead of human annotations.
The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often struggle with reasoning complex, detail-intensive logical structures. To address this limitation, we introduce AgentPS, a novel framework that integrates Agentic Process Supervision into MLLMs by sequentially reasoning over ancillary questions during fine-tuning. AgentPS achieves substantial improvements over baseline MLLMs on both public benchmarks and proprietary datasets. Notably, we show that using MLLM-generated ancillary labels in place of human annotations yields only minimal performance degradation, highlighting the method's scalability. These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.