Boyu Shi

LG
h-index15
7papers
13citations
Novelty53%
AI Score50

7 Papers

LGMar 21, 2023
Data Augmentation For Label Enhancement

Zhiqiang Kou, Yuheng Jia, Jing Wang et al.

Label distribution (LD) uses the description degree to describe instances, which provides more fine-grained supervision information when learning with label ambiguity. Nevertheless, LD is unavailable in many real-world applications. To obtain LD, label enhancement (LE) has emerged to recover LD from logical label. Existing LE approach have the following problems: (\textbf{i}) They use logical label to train mappings to LD, but the supervision information is too loose, which can lead to inaccurate model prediction; (\textbf{ii}) They ignore feature redundancy and use the collected features directly. To solve (\textbf{i}), we use the topology of the feature space to generate more accurate label-confidence. To solve (\textbf{ii}), we proposed a novel supervised LE dimensionality reduction approach, which projects the original data into a lower dimensional feature space. Combining the above two, we obtain the augmented data for LE. Further, we proposed a novel nonlinear LE model based on the label-confidence and reduced features. Extensive experiments on 12 real-world datasets are conducted and the results show that our method consistently outperforms the other five comparing approaches.

71.5CLMay 9
XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Chang Liu, Boyu Shi, Xu Yang et al.

Mixture-of-Experts (MoE) language models organize knowledge into explicitly routed expert modules, making expert-level representations traceable and analyzable. By analyzing expert activation patterns in MoE large language models (LLMs), we find that a subset of experts is consistently activated across diverse knowledge domains. These common experts encode cross-domain, generalizable knowledge that is closely related to model generalization, naturally raising the question of how such identifiable expert knowledge can be practically reused. Motivated by this observation, we propose XPERT, a framework that extracts, consolidates, and reuses expert knowledge from pre-trained MoE LLMs to support more effective training of language models across different model scales. XPERT identifies cross-domain experts via inference-only analysis, refines their representations through tensor decomposition, and adapts the extracted knowledge to reuse in downstream models. Experiments on language understanding and dialogue generation benchmarks show that models benefiting from reused expert knowledge achieve consistently stronger performance and faster convergence compared to strong baselines. These results highlight MoE LLMs as structured and reusable knowledge sources, and demonstrate the value of expert-level knowledge reuse for improving model training.

44.9CLMay 8
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

Boyu Shi, YiCheng Jiang, Chang Liu et al.

Large language models (LLMs) achieve strong performance but remain costly to deploy in resource-constrained settings. Training small language models (SLMs) from scratch is computationally expensive, while conventional knowledge distillation requires repeated access to large teachers for different target sizes, leading to poor scalability. To solve these problems, we propose \textbf{Chain-based Distillation (CBD)}, a scalable paradigm for efficiently initializing variable-sized language models. A sparse and limited sequence of intermediate models (called anchors) is constructed via stepwise distillation, forming a distillation chain that progressively transfers knowledge from the source LLMs. To support heterogeneous settings, we introduce \emph{bridge distillation} for cross-architecture and cross-vocabulary transfer. Models of variable sizes are initialized via parameter interpolation between adjacent anchors, eliminating repeated large teacher inference. Experiments show that the proposed method substantially improves efficiency and downstream performance. A 138M-parameter SLM without recovery pre-training, outperforms scratch-trained models on a 10B-token corpus on the specific task. CBD also demonstrates versatility in heterogeneous settings for initialize models with different architectures and vocabularies.

31.4CLMay 8
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

Boyu Shi, Chang Liu, ChuanBao Gao et al.

Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representation-based analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.

64.1LGMay 6
Learngene Search Across Multiple Datasets for Building Variable-Sized Models

Boyu Shi, Junbo Zhou, Chang Liu et al.

Deep learning methods are widely used under diverse resource constraints, resulting in models of varying sizes, such as the Vision Transformer (ViT) series. Deploying these models typically requires costly pretraining and finetuning. The Learngene paradigm addresses this issue by extracting transferable components, called learngenes, from a pretrained ancestry model (Ans-Net) to initialize variable-sized descendant models (Des-Nets).Existing learngene extraction methods rely on a single dataset, limiting downstream performance. To address this limitation, we propose Learngene Search Across Multiple Datasets for Building Variable-Sized Models (LSAMD). LSAMD expands the Ans-Net into a searchable super Ans-Net with dataset-specific blocks and dataset adapters (DADs). During training, LSAMD searches for an optimal architecture path for each dataset. The base blocks most frequently selected across datasets are extracted as learngenes for initializing Des-Nets.Experiments on multiple datasets show that LSAMD achieves performance comparable to pretrain-finetune methods while significantly reducing storage and training costs.

LGDec 11, 2023
RankMatch: A Novel Approach to Semi-Supervised Label Distribution Learning Leveraging Inter-label Correlations

Kouzhiqiang Yucheng Xie, Jing Wang, Yuheng Jia et al.

This paper introduces RankMatch, an innovative approach for Semi-Supervised Label Distribution Learning (SSLDL). Addressing the challenge of limited labeled data, RankMatch effectively utilizes a small number of labeled examples in conjunction with a larger quantity of unlabeled data, reducing the need for extensive manual labeling in Deep Neural Network (DNN) applications. Specifically, RankMatch introduces an ensemble learning-inspired averaging strategy that creates a pseudo-label distribution from multiple weakly augmented images. This not only stabilizes predictions but also enhances the model's robustness. Beyond this, RankMatch integrates a pairwise relevance ranking (PRR) loss, capturing the complex inter-label correlations and ensuring that the predicted label distributions align with the ground truth. We establish a theoretical generalization bound for RankMatch, and through extensive experiments, demonstrate its superiority in performance against existing SSLDL methods.

LGDec 10, 2023
Building Variable-sized Models via Learngene Pool

Boyu Shi, Shiyu Xia, Xu Yang et al.

Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challenges. 1) Stitching from multiple independently pre-trained anchors introduces high storage resource consumption. 2) SN-Net faces challenges to build smaller models for low resource constraints. 3). SN-Net uses an unlearned initialization method for stitch layers, limiting the final performance. To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method called Learngene Pool. Briefly, Learngene distills the critical knowledge from a large pre-trained model into a small part (termed as learngene) and then expands this small part into a few variable-sized models. In our proposed method, we distill one pretrained large model into multiple small models whose network blocks are used as learngene instances to construct the learngene pool. Since only one large model is used, we do not need to store more large models as SN-Net and after distilling, smaller learngene instances can be created to build small models to satisfy low resource constraints. We also insert learnable transformation matrices between the instances to stitch them into variable-sized models to improve the performance of these models. Exhaustive experiments have been implemented and the results validate the effectiveness of the proposed Learngene Pool compared with SN-Net.