NIARLGJul 20, 2025

Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML

arXiv:2508.03674v31 citationsh-index: 6
Originality Highly original
AI Analysis

This addresses inefficiencies in multi-tenant ML data centers, offering significant performance gains for cloud providers and ML practitioners.

The paper tackles the problem of inefficient multi-tenant machine learning in data centers by developing Morphlux, a programmable photonic fabric that improves bandwidth by up to 66%, reduces compute fragmentation by up to 70%, and enhances training throughput by 1.72X.

We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML data-centers with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize the blast radius of chip failures. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits which translate to 1.72X improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can replace a failed accelerator chip with a healthy one in 1.2 seconds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes