CVMay 4, 2022
Generalized Knowledge Distillation via Relationship MatchingHan-Jia Ye, Su Lu, De-Chuan Zhan
The knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks. Knowledge distillation extracts knowledge from the teacher and integrates it with the target model (a.k.a. the "student"), which expands the student's knowledge and improves its learning efficacy. Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space -- in this "Generalized Knowledge Distillation (GKD)", the classes of the teacher and the student may be the same, completely different, or partially overlapped. We claim that the comparison ability between instances acts as an essential factor threading knowledge across tasks, and propose the RElationship FacIlitated Local cLassifiEr Distillation (REFILLED) approach, which decouples the GKD flow of the embedding and the top-layer classifier. In particular, different from reconciling the instance-label confidence between models, REFILLED requires the teacher to reweight the hard tuples pushed forward by the student and then matches the similarity comparison levels between instances. An embedding-induced classifier based on the teacher model supervises the student's classification confidence and adaptively emphasizes the most related supervision from the teacher. REFILLED demonstrates strong discriminative ability when the classes of the teacher vary from the same to a fully non-overlapped set w.r.t. the student. It also achieves state-of-the-art performance on standard knowledge distillation, one-step incremental learning, and few-shot learning tasks.
26.3LGMay 28
Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model MergingYuanyi Wang, Yanggan Gu, Su Lu et al.
Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.
LGApr 25, 2022
Selective Cross-Task DistillationSu Lu, Han-Jia Ye, De-Chuan Zhan
The outpouring of various pre-trained models empowers knowledge distillation by providing abundant teacher resources, but there lacks a developed mechanism to utilize these teachers adequately. With a massive model repository composed of teachers pre-trained on diverse tasks, we must surmount two obstacles when using knowledge distillation to learn a new task. First, given a fixed computing budget, it is not affordable to try each teacher and train the student repeatedly, making it necessary to seek out the most contributive teacher precisely and efficiently. Second, semantic gaps exist between the teachers and the target student since they are trained on different tasks. Thus, we need to extract knowledge from a general label space that may be different from the student's. Faced with these two challenges, we study a new setting named selective cross-task distillation that includes teacher assessment and generalized knowledge reuse. We bridge the teacher's label space and the student's label space through optimal transport. The transportation cost from the teacher's prediction to the student's prediction measures the relatedness between two tasks and acts as an objective for distillation. Our method reuses cross-task knowledge from a distinct label space and efficiently assesses teachers without enumerating the model repository. Experiments demonstrate the effectiveness of our proposed method.
24.5LGMay 26
Not All Disagreement Is Learnable: Token Teachability in On-Policy DistillationYuanyi Wang, Su Lu, Yanggan Gu et al.
On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.
16.1LGMay 13
FeatCal: Feature Calibration for Post-Merging ModelsYanggan Gu, Shuo Cai, Zihao Wang et al.
Model merging combines task experts into one model and avoids joint training, retraining, or deploying many expert models, but the merged model often still underperforms task experts. We study this performance gap through feature drift, the difference between features produced by the merged model and by the expert on the same input. Our theory decomposes this drift into upstream propagation and local mismatch, tracks how it propagates and combines through later layers in forward order, and links final feature drift to output drift. This view motivates FeatCal, which uses a small calibration set to calibrate the merged model weights layer by layer in forward order, reducing feature drift while staying close to merged weights and preserving the benefits of model merging. FeatCal uses an efficient closed-form solution to update model weights, with no gradient descent, iterative optimization, or extra modules. On the main CLIP and GLUE benchmarks, FeatCal beats Surgery and ProbSurgery, the closest post-merging calibration baselines: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic (TA) and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE. On CLIP-ViT-B/32, 8 examples per task reach 82.9%, and 256 examples per task take 53 seconds, about 4x faster than both baselines, showing better sample efficiency and lower calibration cost.
AIAug 7, 2025Code
InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning CapabilitiesShuo Cai, Su Lu, Qi Zhou et al.
Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.
43.1LGMay 10
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-TrainingYuanyi Wang, Yifan Yang, Su Lu et al.
Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B--14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.
CLFeb 17, 2025
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in ReasoningCongkai Xie, Shuo Cai, Wenjun Wang et al.
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.
CVApr 8, 2021
Few-Shot Action Recognition with Compromised Metric via Optimal TransportSu Lu, Han-Jia Ye, De-Chuan Zhan
Although vital to computer vision systems, few-shot action recognition is still not mature despite the wide research of few-shot image classification. Popular few-shot learning algorithms extract a transferable embedding from seen classes and reuse it on unseen classes by constructing a metric-based classifier. One main obstacle to applying these algorithms in action recognition is the complex structure of videos. Some existing solutions sample frames from a video and aggregate their embeddings to form a video-level representation, neglecting important temporal relations. Others perform an explicit sequence matching between two videos and define their distance as matching cost, imposing too strong restrictions on sequence ordering. In this paper, we propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions. CMOT simultaneously considers semantic and temporal information in videos under Optimal Transport framework, and is discriminative for both content-sensitive and ordering-sensitive tasks. In detail, given two videos, we sample segments from them and cast the calculation of their distance as an optimal transport problem between two segment sequences. To preserve the inherent temporal ordering information, we additionally amend the ground cost matrix by penalizing it with the positional distance between a pair of segments. Empirical results on benchmark datasets demonstrate the superiority of CMOT.
LGApr 8, 2021
Towards Enabling Meta-Learning from Target ModelsSu Lu, Han-Jia Ye, Le Gan et al.
Meta-learning can extract an inductive bias from previous learning experience and assist the training of new tasks. It is often realized through optimizing a meta-model with the evaluation loss of task-specific solvers. Most existing algorithms sample non-overlapping $\mathit{support}$ sets and $\mathit{query}$ sets to train and evaluate the solvers respectively due to simplicity ($\mathcal{S}$/$\mathcal{Q}$ protocol). Different from $\mathcal{S}$/$\mathcal{Q}$ protocol, we can also evaluate a task-specific solver by comparing it to a target model $\mathcal{T}$, which is the optimal model for this task or a model that behaves well enough on this task ($\mathcal{S}$/$\mathcal{T}$ protocol). Although being short of research, $\mathcal{S}$/$\mathcal{T}$ protocol has unique advantages such as offering more informative supervision, but it is computationally expensive. This paper looks into this special evaluation method and takes a step towards putting it into practice. We find that with a small ratio of tasks armed with target models, classic meta-learning algorithms can be improved a lot without consuming many resources. We empirically verify the effectiveness of $\mathcal{S}$/$\mathcal{T}$ protocol in a typical application of meta-learning, $\mathit{i.e.}$, few-shot learning. In detail, after constructing target models by fine-tuning the pre-trained network on those hard tasks, we match the task-specific solvers and target models via knowledge distillation.