CLAIDec 10, 2024

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

arXiv:2412.07380v27 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of generalizing ensemble methods to open-domain queries for users of generative AI, though it is incremental as it builds on existing ensemble techniques.

The paper tackles the problem of improving ensemble methods for large language models by proposing SpecFuse, a framework that iteratively generates and verifies segments collaboratively without training, resulting in reduced computational costs while maintaining performance.

Ensembles of generative large language models (LLMs) can integrate the strengths of different LLMs to compensate for the limitations of individual models. However, recent work has focused on training an additional fusion model to combine complete responses from multiple LLMs, failing to tap into their collaborative potential to generate higher-quality responses. Moreover, as the additional fusion model is trained on a specialized dataset, these methods struggle with generalizing to open-domain queries from online users. In this paper, we propose SpecFuse, a novel ensemble framework that outputs the fused result by iteratively producing the next segment through collaboration among LLMs. This is achieved through cyclic execution of its inference and verification components. In each round, the inference component invokes each base LLM to generate candidate segments in parallel, and the verify component calls these LLMs again to predict the ranking of the segments. The top-ranked segment is then broadcast to all LLMs, encouraging them to generate higher-quality segments in the next round. This approach also allows the base LLMs to be plug-and-play, without any training or adaptation, avoiding generalization limitations. Furthermore, to conserve computational resources, we propose a model exit mechanism that dynamically excludes models exhibiting poor performance in previous rounds during each query response. In this way, it effectively reduces the number of model calls while maintaining overall performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes