IT CL DC LG NIDec 30, 2024

Distributed Mixture-of-Agents for Edge Inference with Large Language Models

Purbesh Mitra, Priyanka Kaswan, Sennur Ulukus

arXiv:2412.21200v15.16 citationsh-index: 63Has CodePIMRC

Originality Incremental advance

AI Analysis

This work addresses efficient and stable LLM inference for edge computing users, but it is incremental as it extends existing MoA methods to a distributed context.

The paper tackles the problem of deploying Mixture-of-Agents (MoA) architectures for large language models (LLMs) in distributed edge settings, where devices with limited memory use gossip algorithms for collaboration, and it results in theoretically derived queuing stability conditions and experimental validation showing certain MoA configurations achieve higher-quality responses on the AlpacaEval 2.0 benchmark.

Mixture-of-Agents (MoA) has recently been proposed as a method to enhance performance of large language models (LLMs), enabling multiple individual LLMs to work together for collaborative inference. This collaborative approach results in improved responses to user prompts compared to relying on a single LLM. In this paper, we consider such an MoA architecture in a distributed setting, where LLMs operate on individual edge devices, each uniquely associated with a user and equipped with its own distributed computing power. These devices exchange information using decentralized gossip algorithms, allowing different device nodes to talk without the supervision of a centralized server. In the considered setup, different users have their own LLM models to address user prompts. Additionally, the devices gossip either their own user-specific prompts or augmented prompts to generate more refined answers to certain queries. User prompts are temporarily stored in the device queues when their corresponding LLMs are busy. Given the memory limitations of edge devices, it is crucial to ensure that the average queue sizes in the system remain bounded. In this paper, we address this by theoretically calculating the queuing stability conditions for the device queues under reasonable assumptions, which we validate experimentally as well. Further, we demonstrate through experiments, leveraging open-source LLMs for the implementation of distributed MoA, that certain MoA configurations produce higher-quality responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The implementation is available at: https://github.com/purbeshmitra/distributed_moa.

View on arXiv PDF Code

Similar