Minsik Cho

LG
h-index37
41papers
905citations
Novelty50%
AI Score57

41 Papers

CVOct 24, 2022
I see what you hear: a vision-inspired method to localize words

Mohammad Samragh, Arnav Kundu, Ting-Yao Hu et al. · apple-ml, stanford

This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%.

CLSep 19, 2024
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Mohammad Samragh, Iman Mirzadeh, Keivan Alizadeh Vahid et al. · utoronto

The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.

CLMar 7
Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Keivan Alizadeh, Parshin Shojaee, Minsik Cho et al. · uw

Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.

CLJul 19, 2024
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Qichen Fu, Minsik Cho, Thomas Merth et al.

The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

ASJun 8, 2023
Matching Latent Encoding for Audio-Text based Keyword Spotting

Kumari Nishu, Minsik Cho, Devang Naik

Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.

SDAug 12, 2023
Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Kumari Nishu, Minsik Cho, Paul Dixon et al.

Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder which inherently has homogeneous representation with audio embedding, and it is also much smaller than a compatible text encoder. Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors, extracted from the paired audio encoder on rich speech datasets. We further augment our method with confusable keyword generation to develop an audio-text embedding verifier with strong discriminative power. Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC) metric from 84.21% to 92.7% and reducing Equal-Error-Rate (EER) metric from 23.36% to 14.4%.

ASOct 26, 2022
HEiMDaL: Highly Efficient Method for Detection and Localization of wake-words

Arnav Kundu, Mohammad Samragh Razlighi, Minsik Cho et al.

Streaming keyword spotting is a widely used solution for activating voice assistants. Deep Neural Networks with Hidden Markov Model (DNN-HMM) based methods have proven to be efficient and widely adopted in this space, primarily because of the ability to detect and identify the start and end of the wake-up word at low compute cost. However, such hybrid systems suffer from loss metric mismatch when the DNN and HMM are trained independently. Sequence discriminative training cannot fully mitigate the loss-metric mismatch due to the inherent Markovian style of the operation. We propose an low footprint CNN model, called HEiMDaL, to detect and localize keywords in streaming conditions. We introduce an alignment-based classification loss to detect the occurrence of the keyword along with an offset loss to predict the start of the keyword. HEiMDaL shows 73% reduction in detection metrics along with equivalent localization accuracy and with the same memory footprint as existing DNN-HMM style models for a given wake-word.

LGMar 14, 2023
R2 Loss: Range Restriction Loss for Model Compression and Quantization

Arnav Kundu, Chungkuk Yoo, Srijan Mishra et al.

Model quantization and compression is widely used techniques to reduce usage of computing resource at inference time. While state-of-the-art works have been achieved reasonable accuracy with higher bit such as 4bit or 8bit, but still it is challenging to quantize/compress a model further, e.g., 1bit or 2bit. To overcome the challenge, we focus on outliers in weights of a pre-trained model which disrupt effective lower bit quantization and compression. In this work, we propose Range Restriction Loss (R2-Loss) for building lower bit quantization and compression friendly models by removing outliers from weights during pre-training. By effectively restricting range of weights, we mold the overall distribution into a tight shape to ensure high quantization bit resolution, therefore allowing model compression and quantization techniques can to utilize their limited numeric representation powers better. We introduce three different, L-inf R2-Loss, its extension Margin R2-Loss and a new Soft-Min-MaxR2-Loss to be used as an auxiliary loss during full-precision model training. These R2-Loss can be used in different cases such as L-inf and Margin R2-Loss would be effective for symmetric quantization, while Soft-Min-Max R2-Loss shows better performance for model compression. In our experiment, R2-Loss improves lower bit quantization accuracy with state-of-the-art post-training quantization (PTQ), quantization-aware training (QAT), and model compression techniques. With R2-Loss, MobileNet-V2 2bit weight and 8bit activation PTQ, MobileNet-V1 2bit weight and activation QAT, ResNet18 1bit weight compression are improved to 59.49% from 50.66%, 59.05% from 55.96%, and 52.58% from 45.54%, respectively.

LGJan 30
MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

Ajay Jaiswal, Lauren Hannah, Han-Byul Kim et al.

Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.

SDApr 5, 2022
Improving Voice Trigger Detection with Metric Learning

Prateeth Nayak, Takuya Higuchi, Anmol Gupta et al.

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model.

AIJul 29, 2024
Apple Intelligence Foundation Language Models

Tom Gunter, Zirui Wang, Chong Wang et al.

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

CLOct 2, 2023
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

Duc N. M Hoang, Minsik Cho, Thomas Merth et al.

Compressing Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs' inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally displaced and hence one requires merely "inference re-direction" with input-side augmentation such as prompting, to recover the knowledge-related performance. Extensive experiments are then designed to (in)validate the two conjectures. We observe the promise of prompting in comparison to model tuning; we further unlock prompting's potential by introducing a variant called Inference-time Dynamic Prompting (IDP), that can effectively increase prompt diversity without incurring any inference overhead. Our experiments consistently suggest that compared to the classical re-training alternatives such as LoRA, prompting with IDP leads to better or comparable post-compression performance recovery, while saving the extra parameter size by 21x and reducing inference latency by 60%. Our experiments hence strongly endorse the conjecture of "knowledge displaced" over "knowledge forgotten", and shed light on a new efficient mechanism to restore compressed LLM performance. We additionally visualize and analyze the different attention and activation patterns between prompted and re-trained models, demonstrating they achieve performance recovery in two different regimes.

LGOct 9, 2023
Streaming Anchor Loss: Augmenting Supervision with Temporal Significance

Utkarsh Oggy Sarawgi, John Berkowitz, Vineet Garg et al.

Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming models (i.e., by adding more parameters) to improve the predictive power may not be viable for real-world tasks. In this work, we propose a new loss, Streaming Anchor Loss (SAL), to better utilize the given learning capacity by encouraging the model to learn more from essential frames. More specifically, our SAL and its focal variations dynamically modulate the frame-wise cross entropy loss based on the importance of the corresponding frames so that a higher loss penalty is assigned for frames within the temporal proximity of semantically critical events. Therefore, our loss ensures that the model training focuses on predicting the relatively rare but task-relevant frames. Experimental results with standard lightweight convolutional and recurrent streaming networks on three different speech based detection tasks demonstrate that SAL enables the model to learn the overall task more effectively with improved accuracy and latency, without any additional data, model parameters, or architectural changes.

ASSep 6, 2024
SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

Kumari Nishu, Minsik Cho, Devang Naik

User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method for efficient keyword spotting, SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). We further introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity, thus distinguishing similar-sounding keywords more effectively through enhanced context. In SLiCK, the model is trained with a multi-task learning approach using two modules: Matcher (utterance-level matching task, novel subsequence-level matching task) and Encoder (phoneme recognition task). The proposed method improves the baseline results on Libriphrase hard dataset, increasing AUC from $88.52$ to $94.9$ and reducing EER from $18.82$ to $11.1$.

LGMay 11
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Mohammadreza Armandpour, Fatih Ilhan, David Harrison et al.

On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.

CLDec 12, 2023
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko et al. · uw

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

CLSep 22, 2025Code
EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Minsoo Kim, Arnav Kundu, Han-Byul Kim et al.

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting the KV cache after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40%, maintains near-full KV accuracy under 4-6x compression, and reduces latency/memory by up to 2.4x/3.5x, enabling efficient multi-turn interaction under strict resource limits. Our code is available at https://github.com/apple/ml-epicache.

DCNov 29, 2018Code
Data-parallel distributed training of very large models beyond GPU capacity

Samuel Matzek, Max Grossman, Minsik Cho et al.

GPUs have limited memory and it is difficult to train wide and/or deep models that cause the training process to go out of memory. It is shown in this paper how an open source tool called Large Model Support (LMS) can utilize a high bandwidth NVLink connection between CPUs and GPUs to accomplish training of deep convolutional networks. LMS performs tensor swapping between CPU memory and GPU memory such that only a minimal number of tensors required in a training step are kept in the GPU memory. It is also shown how LMS can be combined with an MPI based distributed deep learning module to train models in a data-parallel fashion across multiple GPUs, such that each GPU is utilizing the CPU memory for tensor swapping. The hardware architecture that enables the high bandwidth GPU link with the CPU is discussed as well as the associated set of software tools that are available as the PowerAI package.

CLMay 7
TIDE: Every Layer Knows the Token Beneath the Context

Ajay Jaiswal, Lauren Hannah, Han-Byul Kim et al.

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

DCMay 8, 2024
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Minsik Cho, Mohammad Rastegari, Devang Naik

Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output the first token and the extension (or decoding) phase to the generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache). Hence, KV-Runahead parallelizes the prompt phase by orchestrating multiple processes to populate the KV-cache and minimizes the time-to-first-token (TTFT). Dual-purposing the KV-cache scheme has two main benefits. First, since KV-cache is designed to leverage the causal attention map, we minimize computation and computation automatically. Second, since it already exists for the extension phase, KV-Runahead is easy to implement. We further propose context-level load-balancing to handle uneven KV-cache generation (due to the causal attention) and to optimize TTFT. Compared with an existing parallelization scheme such as tensor or sequential parallelization where keys and values are locally generated and exchanged via all-gather collectives, our experimental results demonstrate that KV-Runahead can offer over 1.4x and 1.6x speedups for Llama 7B and Falcon 7B respectively.

LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

CLJul 16, 2025
Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential

Mohammad Samragh, Arnav Kundu, David Harrison et al.

Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and semantics of text are relatively certain. In this work, we propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens, combining techniques to realize this potential and enable simultaneous prediction of multiple subsequent tokens. Our approach introduces several key innovations: (1) a masked-input formulation where multiple future tokens are jointly predicted from a common prefix; (2) a gated LoRA formulation that preserves the original LLM's functionality, while equipping it for multi-token prediction; (3) a lightweight, learnable sampler module that generates coherent sequences from the predicted future tokens; (4) a set of auxiliary training losses, including a consistency loss, to enhance the coherence and accuracy of jointly generated tokens; and (5) a speculative generation strategy that expands tokens quadratically in the future while maintaining high fidelity. Our method achieves significant speedups through supervised fine-tuning on pretrained models. For example, it generates code and math nearly 5x faster, and improves general chat and knowledge tasks by almost 2.5x. These gains come without any loss in quality.

DCFeb 28, 2025
SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Han-Byul Kim, Duc Hoang, Arnav Kundu et al.

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

AINov 12, 2024
Towards Low-bit Communication for Tensor Parallel LLM Inference

Harry Dong, Tyler Johnson, Minsik Cho et al.

Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem is with quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage of consistent outliers in communicated features, we introduce a quantization method that reduces communicated values on average from 16 bits to 4.2 bits while preserving nearly all of the original performance. For instance, our method maintains around 98.0% and 99.5% of Gemma 2 27B's and Llama 2 13B's original performance, respectively, averaged across all tasks we evaluated on.

CLFeb 17, 2025
From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

Kumari Nishu, Sachin Mehta, Samira Abnar et al.

Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.

LGFeb 3
SpecMD: A Comprehensive Study On Speculative Expert Prefetching

Duc Hoang, Ajay Jaiswal, Mohammad Samragh et al.

Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model's parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop \textbf{SpecMD}, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose \textbf{Least-Stale}, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to $85\times$ over LRU. With such gains, we achieve over $88\%$ hit rates with up to $34.7\%$ Time-to-first-token (TTFT) reduction on OLMoE at only $5\%$ or $0.6GB$ of VRAM cache capacity.

CLOct 15, 2025
Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Nikhil Bhendawade, Kumari Nishu, Arnav Kundu et al.

Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.

LGSep 27, 2025
MoE-PHDS: One MoE checkpoint for flexible runtime sparsity

Lauren. A Hannah, Soheil Zibakhsh, Kumari Nishu et al.

Sparse Mixtures of Experts (MoEs) are typically trained to operate at a fixed sparsity level, e.g. $k$ in a top-$k$ gating function. This global sparsity level determines an operating point on the accuracy/latency curve; currently, meeting multiple efficiency targets means training and maintaining multiple models. This practice complicates serving, increases training and maintenance costs, and limits flexibility in meeting diverse latency, efficiency, and energy requirements. We show that pretrained MoEs are more robust to runtime sparsity shifts than commonly assumed, and introduce MoE-PHDS ({\bf P}ost {\bf H}oc {\bf D}eclared {\bf S}parsity), a lightweight SFT method that turns a single checkpoint into a global sparsity control surface. PHDS mixes training across sparsity levels and anchors with a short curriculum at high sparsity, requiring no architectural changes. The result is predictable accuracy/latency tradeoffs from one model: practitioners can ``dial $k$'' at inference time without swapping checkpoints, changing architecture, or relying on token-level heuristics. Experiments on OLMoE-1B-7B-0125, Qwen1.5-MoE-A2.7B, and proprietary models fit on multiple operating points show that PHDS matches or exceeds well-specified oracle models, improves cross-sparsity agreement by up to 22\% vs. well-specified oracle models, and enables simplified, flexible runtime MoE deployment by making global sparsity a first-class serving primitive.

AISep 21, 2025
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu et al.

The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

LGSep 2, 2023
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models

Minsik Cho, Keivan A. Vahid, Qichen Fu et al.

Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this paper, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that \prjname can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3bit/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130$\times$, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).

LGMay 18, 2023
PDP: Parameter-free Differentiable Pruning is All You Need

Minsik Cho, Saurabh Adya, Devang Naik

DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state-of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision and natural language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy. In addition, PDP can be applied to structured pruning, such as N:M pruning and channel pruning. For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k accuracy by over 3.6% over the state-of-the-art. For channel pruning of ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the state-of-the-art.

LGAug 28, 2021
DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

Minsik Cho, Keivan A. Vahid, Saurabh Adya et al.

Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) and its application to train-time weight clustering-based DNN model compression. DKM casts k-means clustering as an attention problem and enables joint optimization of the DNN parameters and clustering centroids. Unlike prior works that rely on additional regularizers and parameters, DKM-based compression keeps the original loss function and model architecture fixed. We evaluated DKM-based compression on various DNN models for computer vision and natural language processing (NLP) tasks. Our results demonstrate that DKM delivers superior compression and accuracy trade-off on ImageNet1k and GLUE benchmarks. For example, DKM-based compression can offer 74.5% top-1 ImageNet1k accuracy on ResNet50 DNN model with 3.3MB model size (29.4x model compression factor). For MobileNet-v1, which is a challenging DNN to compress, DKM delivers 63.9% top-1 ImageNet1k accuracy with 0.72 MB model size (22.4x model compression factor). This result is 6.8% higher top-1accuracy and 33% relatively smaller model size than the current state-of-the-art DNN compression algorithms. Additionally, DKM enables compression of DistilBERT model by 11.8x with minimal (1.1%) accuracy loss on GLUE NLP benchmarks.

CVNov 20, 2020
Large Scale Neural Architecture Search with Polyharmonic Splines

Ulrich Finkler, Michele Merler, Rameswar Panda et al.

Neural Architecture Search (NAS) is a powerful tool to automatically design deep neural networks for many tasks, including image classification. Due to the significant computational burden of the search phase, most NAS methods have focused so far on small, balanced datasets. All attempts at conducting NAS at large scale have employed small proxy sets, and then transferred the learned architectures to larger datasets by replicating or stacking the searched cells. We propose a NAS method based on polyharmonic splines that can perform search directly on large scale, imbalanced target datasets. We demonstrate the effectiveness of our method on the ImageNet22K benchmark[16], which contains 14 million images distributed in a highly imbalanced manner over 21,841 categories. By exploring the search space of the ResNet [23] and Big-Little Net ResNext [11] architectures directly on ImageNet22K, our polyharmonic splines NAS method designed a model which achieved a top-1 accuracy of 40.03% on ImageNet22K, an absolute improvement of 3.13% over the state of the art with similar global batch size [15].

CVJun 23, 2020
NASTransfer: Analyzing Architecture Transferability in Large Scale Neural Architecture Search

Rameswar Panda, Michele Merler, Mayoore Jaiswal et al.

Neural Architecture Search (NAS) is an open and challenging problem in machine learning. While NAS offers great promise, the prohibitive computational demand of most of the existing NAS methods makes it difficult to directly search the architectures on large-scale tasks. The typical way of conducting large scale NAS is to search for an architectural building block on a small dataset (either using a proxy set from the large dataset or a completely different small scale dataset) and then transfer the block to a larger dataset. Despite a number of recent results that show the promise of transfer from proxy datasets, a comprehensive evaluation of different NAS methods studying the impact of different source datasets has not yet been addressed. In this work, we propose to analyze the architecture transferability of different NAS methods by performing a series of experiments on large scale benchmarks such as ImageNet1K and ImageNet22K. We find that: (i) The size and domain of the proxy set does not seem to influence architecture performance on the target dataset. On average, transfer performance of architectures searched using completely different small datasets (e.g., CIFAR10) perform similarly to the architectures searched directly on proxy target datasets. However, design of proxy sets has considerable impact on rankings of different NAS methods. (ii) While different NAS methods show similar performance on a source dataset (e.g., CIFAR10), they significantly differ on the transfer performance to a large dataset (e.g., ImageNet1K). (iii) Even on large datasets, random sampling baseline is very competitive, but the choice of the appropriate combination of proxy set and search strategy can provide significant improvement over it. We believe that our extensive empirical analysis will prove useful for future design of NAS algorithms.

LGJan 14, 2020
SimEx: Express Prediction of Inter-dataset Similarity by a Fleet of Autoencoders

Inseok Hwang, Jinho Lee, Frank Liu et al.

Knowing the similarity between sets of data has a number of positive implications in training an effective model, such as assisting an informed selection out of known datasets favorable to model transfer or data augmentation problems with an unknown dataset. Common practices to estimate the similarity between data include comparing in the original sample space, comparing in the embedding space from a model performing a certain task, or fine-tuning a pretrained model with different datasets and evaluating the performance changes therefrom. However, these practices would suffer from shallow comparisons, task-specific biases, or extensive time and computations required to perform comparisons. We present SimEx, a new method for early prediction of inter-dataset similarity using a set of pretrained autoencoders each of which is dedicated to reconstructing a specific part of known data. Specifically, our method takes unknown data samples as input to those pretrained autoencoders, and evaluate the difference between the reconstructed output samples against their original input samples. Our intuition is that, the more similarity exists between the unknown data samples and the part of known data that an autoencoder was trained with, the better chances there could be that this autoencoder makes use of its trained knowledge, reconstructing output samples closer to the originals. We demonstrate that our method achieves more than 10x speed-up in predicting inter-dataset similarity compared to common similarity-estimating practices. We also demonstrate that the inter-dataset similarity estimated by our method is well-correlated with common practices and outperforms the baselines approaches of comparing at sample- or embedding-spaces, without newly training anything at the comparison time.

GR-QCNov 26, 2019
Enabling real-time multi-messenger astrophysics discoveries with deep learning

E. A. Huerta, Gabrielle Allen, Igor Andreoni et al.

Multi-messenger astrophysics is a fast-growing, interdisciplinary field that combines data, which vary in volume and speed of data processing, from many different instruments that probe the Universe using different cosmic messengers: electromagnetic waves, cosmic rays, gravitational waves and neutrinos. In this Expert Recommendation, we review the key challenges of real-time observations of gravitational wave sources and their electromagnetic and astroparticle counterparts, and make a number of recommendations to maximize their potential for scientific discovery. These recommendations refer to the design of scalable and computationally efficient machine learning algorithms; the cyber-infrastructure to numerically simulate astrophysical sources, and to process and interpret multi-messenger astrophysics data; the management of gravitational wave detections to trigger real-time alerts for electromagnetic and astroparticle follow-ups; a vision to harness future developments of machine learning and cyber-infrastructure resources to cope with the big-data requirements; and the need to build a community of experts to realize the goals of multi-messenger astrophysics.

LGOct 15, 2019
MUTE: Data-Similarity Driven Multi-hot Target Encoding for Neural Network Design

Mayoore S. Jaiswal, Bumsoo Kang, Jinho Lee et al.

Target encoding is an effective technique to deliver better performance for conventional machine learning methods, and recently, for deep neural networks as well. However, the existing target encoding approaches require significant increase in the learning capacity, thus demand higher computation power and more training data. In this paper, we present a novel and efficient target encoding scheme, MUTE to improve both generalizability and robustness of a target model by understanding the inter-class characteristics of a target dataset. By extracting the confusion level between the target classes in a dataset, MUTE strategically optimizes the Hamming distances among target encoding. Such optimized target encoding offers higher classification strength for neural network models with negligible computation overhead and without increasing the model size. When MUTE is applied to the popular image classification networks and datasets, our experimental results show that MUTE offers better generalization and defense against the noises and adversarial attacks over the existing solutions.

IMFeb 1, 2019
Deep Learning for Multi-Messenger Astrophysics: A Gateway for Discovery in the Big Data Era

Gabrielle Allen, Igor Andreoni, Etienne Bachelet et al.

This report provides an overview of recent work that harnesses the Big Data Revolution and Large Scale Computing to address grand computational challenges in Multi-Messenger Astrophysics, with a particular emphasis on real-time discovery campaigns. Acknowledging the transdisciplinary nature of Multi-Messenger Astrophysics, this document has been prepared by members of the physics, astronomy, computer science, data science, software and cyberinfrastructure communities who attended the NSF-, DOE- and NVIDIA-funded "Deep Learning for Multi-Messenger Astrophysics: Real-time Discovery at Scale" workshop, hosted at the National Center for Supercomputing Applications, October 17-19, 2018. Highlights of this report include unanimous agreement that it is critical to accelerate the development and deployment of novel, signal-processing algorithms that use the synergy between artificial intelligence (AI) and high performance computing to maximize the potential for scientific discovery with Multi-Messenger Astrophysics. We discuss key aspects to realize this endeavor, namely (i) the design and exploitation of scalable and computationally efficient AI algorithms for Multi-Messenger Astrophysics; (ii) cyberinfrastructure requirements to numerically simulate astrophysical sources, and to process and interpret Multi-Messenger Astrophysics data; (iii) management of gravitational wave detections and triggers to enable electromagnetic and astro-particle follow-ups; (iv) a vision to harness future developments of machine and deep learning and cyberinfrastructure resources to cope with the scale of discovery in the Big Data Era; (v) and the need to build a community that brings domain experts together with data scientists on equal footing to maximize and accelerate discovery in the nascent field of Multi-Messenger Astrophysics.

LGJul 26, 2018
A Unified Approximation Framework for Compressing and Accelerating Deep Neural Networks

Yuzhe Ma, Ran Chen, Wei Li et al.

Deep neural networks (DNNs) have achieved significant success in a variety of real world applications, i.e., image classification. However, tons of parameters in the networks restrict the efficiency of neural networks due to the large model size and the intensive computation. To address this issue, various approximation techniques have been investigated, which seek for a light weighted network with little performance degradation in exchange of smaller model size or faster inference. Both low-rankness and sparsity are appealing properties for the network approximation. In this paper we propose a unified framework to compress the convolutional neural networks (CNNs) by combining these two properties, while taking the nonlinear activation into consideration. Each layer in the network is approximated by the sum of a structured sparse component and a low-rank component, which is formulated as an optimization problem. Then, an extended version of alternating direction method of multipliers (ADMM) with guaranteed convergence is presented to solve the relaxed optimization problem. Experiments are carried out on VGG-16, AlexNet and GoogLeNet with large image classification datasets. The results outperform previous work in terms of accuracy degradation, compression rate and speedup ratio. The proposed method is able to remarkably compress the model (with up to 4.9x reduction of parameters) at a cost of little loss or without loss on accuracy.

DCAug 7, 2017
PowerAI DDL

Minsik Cho, Ulrich Finkler, Sameer Kumar et al.

As deep neural networks become more complex and input datasets grow larger, it can take days or even weeks to train a deep neural network to the desired accuracy. Therefore, distributed Deep Learning at a massive scale is a critical capability, since it offers the potential to reduce the training time from weeks to hours. In this paper, we present a software-hardware co-optimized distributed Deep Learning system that can achieve near-linear scaling up to hundreds of GPUs. The core algorithm is a multi-ring communication pattern that provides a good tradeoff between latency and bandwidth and adapts to a variety of system configurations. The communication algorithm is implemented as a library for easy use. This library has been integrated into Tensorflow, Caffe, and Torch. We train Resnet-101 on Imagenet 22K with 64 IBM Power8 S822LC servers (256 GPUs) in about 7 hours to an accuracy of 33.8 % validation accuracy. Microsoft's ADAM and Google's DistBelief results did not reach 30 % validation accuracy for Imagenet 22K. Compared to Facebook AI Research's recent paper on 256 GPU training, we use a different communication algorithm, and our combined software and hardware system offers better communication overhead for Resnet-50. A PowerAI DDL enabled version of Torch completed 90 epochs of training on Resnet 50 for 1K classes in 50 minutes using 64 IBM Power8 S822LC servers (256 GPUs).

LGJun 21, 2017
MEC: Memory-efficient Convolution for Deep Neural Network

Minsik Cho, Daniel Brand

Convolution is a critical component in modern deep neural networks, thus several algorithms for convolution have been developed. Direct convolution is simple but suffers from poor performance. As an alternative, multiple indirect methods have been proposed including im2col-based convolution, FFT-based convolution, or Winograd-based algorithm. However, all these indirect methods have high memory-overhead, which creates performance degradation and offers a poor trade-off between performance and memory consumption. In this work, we propose a memory-efficient convolution or MEC with compact lowering, which reduces memory-overhead substantially and accelerates convolution process. MEC lowers the input matrix in a simple yet efficient/compact way (i.e., much less memory-overhead), and then executes multiple small matrix multiplications in parallel to get convolution completed. Additionally, the reduced memory footprint improves memory sub-system efficiency, improving performance. Our experimental results show that MEC reduces memory consumption significantly with good speedup on both mobile and server platforms, compared with other indirect convolution algorithms.