DCMay 24Code
Efficient Distributed MLLM Training with CornstarchInsu Jang, Runyu Lu, Nikhil Bansal et al.
Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. While there are a few works that have attempted to address the heterogeneity in MLLM training, their approaches are limited to only superficially considering the characteristics of MLLMs. In this paper, we present Cornstarch, an efficient distributed MLLM training framework that contemplates MLLM's unique characteristics in both model and data parallelization. Cornstarch introduces frozen-aware pipeline parallelism and token workload-balanced context parallelism to improve MLLM training throughput. Our extensive evaluation shows that Cornstarch outperforms state-of-the-art solutions by $2.26\times$ on average in terms of MLLM training throughput. Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.
DCMay 21Code
Orbax: Distributed Checkpointing with JAXColin Gaffney, Shutong Li, Daniel Ng et al.
In a landscape of high-performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX's modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a modular, JAX-native checkpointing library that abstracts the complexities of distributed accelerator systems while also providing flexibility for user-friendly checkpoint manipulations throughout the ML model lifecycle. We demonstrate performance exceeding comparable PyTorch competitors by up to 3.5$\times$ for saving and 2$\times$ for loading. The library is available at https://github.com/google/orbax.
DSApr 28
Expander Decomposition with Almost Optimal OverheadNikhil Bansal, Arun Jambulapati, Thatchaphol Saranurak
We present the first polynomial-time algorithm for computing a near-optimal \emph{flow}-expander decomposition. Given a graph $G$ and a parameter $ϕ$, our algorithm removes at most a $ϕ\log^{1+o(1)}n$ fraction of edges so that every remaining connected component is a $ϕ$-\emph{flow}-expander (a stronger guarantee than being a $ϕ$-\emph{cut}-expander). This achieves overhead $\log^{1+o(1)}n$, nearly matching the $Ω(\log n)$ graph-theoretic lower bound that already holds for cut-expander decompositions, up to a $\log^{o(1)}n$ factor. Prior polynomial-time algorithms required removing $O(ϕ\log^{1.5}n)$ and $O(ϕ\log^{2}n)$ fractions of edges to guarantee $ϕ$-cut-expander and $ϕ$-flow-expander components, respectively.
QUANT-PHApr 16
Cloning is as Hard as Learning for Stabilizer StatesNikhil Bansal, Matthias C. Caro, Gaurav Mahajan
The impossibility of simultaneously cloning non-orthogonal states lies at the foundations of quantum theory. Even when allowing for approximation errors, cloning an arbitrary unknown pure state requires as many initial copies as needed to fully learn the state. Rather than arbitrary unknown states, modern quantum learning theory often considers structured classes of states and exploits such structure to develop learning algorithms that outperform general-state tomography. This raises the question: How do the sample complexities of learning and cloning relate for such structured classes? We answer this question for an important class of states. Namely, for $n$-qubit stabilizer states, we show that the optimal sample complexity of cloning is $Θ(n)$. Thus, also for this structured class of states, cloning is as hard as learning. To prove these results, we use representation-theoretic tools in the recently proposed Abelian State Hidden Subgroup framework and a new structured version of the recently introduced random purification channel to relate stabilizer state cloning to a variant of the sample amplification problem for probability distributions that was recently introduced in classical learning theory. This allows us to obtain our cloning lower bounds by proving new sample amplification lower bounds for classes of distributions with an underlying linear structure. Our results provide a more fine-grained perspective on No-Cloning theorems, opening up connections from foundations to quantum learning theory and quantum cryptography.
CVNov 10, 2022
DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary GamesNikhil Bansal, Kartik Gupta, Kiruthika Kannan et al.
Pictionary, the popular sketch-based guessing game, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. However, some players occasionally draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and impairs the game experience. To address such situations in a timely and scalable manner, we introduce DrawMon, a novel distributed framework for automatic detection of atypical sketch content in concurrently occurring Pictionary game sessions. We build specialized online interfaces to collect game session data and annotate atypical sketch content, resulting in AtyPict, the first ever atypical sketch content dataset. We use AtyPict to train CanvasNet, a deep neural atypical content detection network. We utilize CanvasNet as a core component of DrawMon. Our analysis of post deployment game session data indicates DrawMon's effectiveness for scalable monitoring and atypical sketch content detection. Beyond Pictionary, our contributions also serve as a design guide for customized atypical content response systems involving shared and interactive whiteboards. Code and datasets are available at https://drawm0n.github.io.
LGDec 12, 2023
Reducing Energy Bloat in Large Model TrainingJae-Won Chung, Yile Gu, Insu Jang et al.
Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during training directly contributes to end-to-end throughput; a significant portion can be removed without slowing down training. We call this portion energy bloat. In this work, we identify two independent sources of energy bloat in large model training and propose Perseus, a training system that mitigates both. To do this, Perseus obtains the time--energy tradeoff frontier of a large model training job using an efficient graph cut-based algorithm, and schedules computation energy consumption across time to reduce both types of energy bloat. Evaluation on large models, including GPT-3 and Bloom, shows that Perseus reduces the energy consumption of large model training by up to 30% without any throughput loss or hardware modification.
DSApr 5
Online Graph Balancing and the Power of Two ChoicesNikhil Bansal, Milind Prabhu, Sahil Singla et al.
In the classic online graph balancing problem, edges arrive sequentially and must be oriented immediately upon arrival, to minimize the maximum in-degree. For adversarial arrivals, the natural greedy algorithm is $O(\log n)$-competitive, and this bound is the best possible for any algorithm, even with randomization. We study this problem in the i.i.d. model where a base graph $G$ is known in advance and each arrival is an independent uniformly random edge of $G$. This model generalizes the standard power-of-two choices setting, corresponding to $G = K_n$, where the greedy algorithm achieves an $O(\log\!\log n)$ guarantee. We ask whether a similar bound is possible for arbitrary base graphs. While the greedy algorithm is optimal for adversarial arrivals and also for i.i.d. arrivals from regular base graphs (such as $G = K_n$), we show that it can perform poorly in general: there exist mildly irregular graphs $G$ for which greedy is $\widetildeΩ(\log n)$-competitive under i.i.d. arrivals. In sharp contrast, our main result is an $O(\log\!\log n)$-competitive online algorithm for every base graph $G$; this is optimal up to constant factors, since an $Ω(\log\!\log n)$ lower bound already holds even for the complete graph $G = K_n$. The key new idea is a notion of log-skewness for graphs, which captures the irregular substructures in $G$ that force the offline optimum to be large. Moreover, we show that any base graph can be decomposed into ``skew-biregular'' pieces at only $O(\log\!\log n)$ scales of log-skewness, and use this to design a decomposition-based variant of greedy that is $O(\log\!\log n)$-competitive.
LGDec 13, 2017
Potential-Function Proofs for First-Order MethodsNikhil Bansal, Anupam Gupta
This note discusses proofs for convergence of first-order methods based on simple potential-function arguments. We cover methods like gradient descent (for both smooth and non-smooth settings), mirror descent, and some accelerated variants.
DSDec 8, 2016
Faster Space-Efficient Algorithms for Subset Sum, k-Sum and Related ProblemsNikhil Bansal, Shashwat Garg, Jesper Nederlof et al.
We present space efficient Monte Carlo algorithms that solve Subset Sum and Knapsack instances with $n$ items using $O^*(2^{0.86n})$ time and polynomial space, where the $O^*(\cdot)$ notation suppresses factors polynomial in the input size. Both algorithms assume random read-only access to random bits. Modulo this mild assumption, this resolves a long-standing open problem in exact algorithms for NP-hard problems. These results can be extended to solve Binary Linear Programming on $n$ variables with few constraints in a similar running time. We also show that for any constant $k\geq 2$, random instances of $k$-Sum can be solved using $O(n^{k-0.5}polylog(n))$ time and $O(\log n)$ space, without the assumption of random access to random bits. Underlying these results is an algorithm that determines whether two given lists of length $n$ with integers bounded by a polynomial in $n$ share a common value. Assuming random read-only access to random bits, we show that this problem can be solved using $O(\log n)$ space significantly faster than the trivial $O(n^2)$ time algorithm if no value occurs too often in the same list.