CLApr 8

DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

Caleb Zheng, Jyotika Singh, Fang Tu, Weiyi Sun, Sujeeth Bharadwaj, Yassine Benajiba, Sujith Ravi, Eli Shlizerman, Dan Roth

arXiv:2604.0662737.01 citationsh-index: 22

Predicted impact top 10% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the computational inefficiency of prompt compression for users of large language models, offering a faster and more controllable method, though it is incremental as it builds on existing pruning techniques.

The paper tackles the problem of long and expensive prompts in large language models by introducing DiffuMask, a diffusion-based framework for prompt pruning that achieves up to 80% prompt length reduction while maintaining or improving accuracy across various settings.

In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80\% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

View on arXiv PDF

Similar