CL AIDec 15, 2024

AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs

Mingchao Liu, Yu Sun, Ruixiao Sun, Xin Dong, Xiang Shen, Hongyu Xiong

arXiv:2412.15251v21.01 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the challenge of complex multimodal classification for large-scale industrial applications, representing an incremental advancement in enhancing MLLM reasoning capabilities.

The paper tackles the problem of multimodal large language models (MLLMs) struggling with complex logical reasoning by introducing AgentPS, a framework that integrates agentic process supervision through sequential reasoning over ancillary questions during fine-tuning, achieving substantial improvements over baseline MLLMs on public benchmarks and proprietary datasets with minimal performance degradation when using MLLM-generated labels instead of human annotations.

The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often struggle with reasoning complex, detail-intensive logical structures. To address this limitation, we introduce AgentPS, a novel framework that integrates Agentic Process Supervision into MLLMs by sequentially reasoning over ancillary questions during fine-tuning. AgentPS achieves substantial improvements over baseline MLLMs on both public benchmarks and proprietary datasets. Notably, we show that using MLLM-generated ancillary labels in place of human annotations yields only minimal performance degradation, highlighting the method's scalability. These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.

View on arXiv PDF

Similar