LGFeb 5
Compressing LLMs with MoP: Mixture of PrunersBruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias et al.
The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a single dimension-depth or width. We introduce MoP (Mixture of Pruners), an iterative framework that unifies these dimensions. At each iteration, MoP generates two branches-pruning in depth versus pruning in width-and selects a candidate to advance the path. On LLaMA-2 and LLaMA-3, MoP advances the frontier of structured pruning, exceeding the accuracy of competing methods across a broad set of compression regimes. It also consistently outperforms depth-only and width-only pruning. Furthermore, MoP translates structural pruning into real speedup, reducing end-to-end latency by 39% at 40% compression. Finally, extending MoP to the vision-language model LLaVA-1.5, we notably improve computational efficiency and demonstrate that text-only recovery fine-tuning can restore performance even on visual tasks.
LGDec 3, 2025
Technical Report on Text Dataset DistillationKeith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara et al.
In the vision domain, dataset distillation arises as a technique to condense a large dataset into a smaller synthetic one that exhibits a similar result in the training process. While image data presents an extensive literature of distillation methods, text dataset distillation has fewer works in comparison. Text dataset distillation initially grew as an adaptation of efforts from the vision universe, as the particularities of the modality became clear obstacles, it rose into a separate branch of research. Several milestones mark the development of this area, such as the introduction of methods that use transformer models, the generation of discrete synthetic text, and the scaling to decoder-only models with over 1B parameters. Despite major advances in modern approaches, the field remains in a maturing phase, with room for improvement on benchmarking standardization, approaches to overcome the discrete nature of text, handling complex tasks, and providing explicit examples of real-world applications. In this report, we review past and recent advances in dataset distillation for text, highlighting different distillation strategies, key contributions, and general challenges.
LGApr 29, 2025
Efficient LLMs with AMP: Attention Heads and MLP PruningLeandro Giusti Mugnaini, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara et al.
Deep learning drives a new wave in computing systems and triggers the automation of increasingly complex problems. In particular, Large Language Models (LLMs) have significantly advanced cognitive tasks, often matching or even surpassing human-level performance. However, their extensive parameters result in high computational costs and slow inference, posing challenges for deployment in resource-limited settings. Among the strategies to overcome the aforementioned challenges, pruning emerges as a successful mechanism since it reduces model size while maintaining predictive ability. In this paper, we introduce AMP: Attention Heads and MLP Pruning, a novel structured pruning method that efficiently compresses LLMs by removing less critical structures within Multi-Head Attention (MHA) and Multilayer Perceptron (MLP). By projecting the input data onto weights, AMP assesses structural importance and overcomes the limitations of existing techniques, which often fall short in flexibility or efficiency. In particular, AMP surpasses the current state-of-the-art on commonsense reasoning tasks by up to 1.49 percentage points, achieving a 30% pruning ratio with minimal impact on zero-shot task performance. Moreover, AMP also improves inference speeds, making it well-suited for deployment in resource-constrained environments. We confirm the flexibility of AMP on different families of LLMs, including LLaMA and Phi.