CLFeb 27, 2025

Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Chenyu You, Shafiq Joty, Giuseppe Carenini

arXiv:2502.20592v316 citationsh-index: 61Proceedings of The 5th New Frontiers in Summarization Workshop

Originality Incremental advance

AI Analysis

This work addresses a gap in natural language generation for MDS, offering a scalable solution that could benefit applications requiring efficient document processing, though it is incremental in extending existing test-time scaling methods to a new domain.

The paper tackles the challenge of applying test-time scaling to multi-document summarization (MDS) by proposing a framework that uses prompt ensembles and an aggregator to generate refined summaries, resulting in significant quality improvements as demonstrated through extensive experiments.

Recent advances in test-time scaling have shown promising results in improving Large Language Model (LLM) performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), particularly summarization, remains unexplored. Multi-Document Summarization (MDS), a fundamental task in NLG, presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more nuanced approach to prompt design and ensemble methods, as no single "best" prompt can satisfy diverse summarization requirements. We propose a novel framework leveraging test-time scaling for MDS. Our approach employs prompt ensemble techniques to generate multiple candidate summaries using various prompts, then combines them with an aggregator to produce a refined summary. To evaluate our method effectively, we also introduce two new LLM-based metrics: the Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (LLM-ACU) score, which assess summary quality while addressing the positional bias inherent in traditional automatic evaluation. Our extensive experiments demonstrate that this framework significantly enhances summary quality while also revealing the practical scaling boundaries to MDS tasks.

View on arXiv PDF

Similar