Yinqi Yang

h-index45

3papers

42citations

Novelty45%

AI Score41

Ranked #93,886 of 201,326 authors (top 47%)#17,385 in CL (top 54%)

3 Papers

CLFeb 4

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu, Tian Wu et al.

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

77.3CLMar 25

Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen, Yilong Chen, Yinqi Yang et al.

Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.

CRAug 6, 2018

Intrusion Prediction with System-call Sequence-to-Sequence Model

ShaoHua Lv, Jian Wang, YinQi Yang et al.

The advanced development of the Internet facilitates efficient information exchange while also been exploited by adversaries. Intrusion detection system (IDS) as an important defense component of network security has always been widely studied in security research. However, research on intrusion prediction, which is more critical for network security, is received less attention. We argue that the advanced anticipation and timely impede of invasion is more vital than simple alarms in security defenses. General research methods regarding prediction are analyzing short term of system-calls to predict forthcoming abnormal behaviors. In this paper we take advantages of the remarkable performance of recurrent neural networks (RNNs) in dealing with long sequential problem, introducing the sequence-to-sequence model into our intrusion prediction work. By semantic modeling system-calls we build a robust system-call sequence-to-sequence prediction model. With taking the system-call traces invoked during the program running as known prerequisite, our model predicts sequence of system-calls that is most likely to be executed in a near future period of time that enabled the ability of monitoring system status and prophesying the intrusion behaviors. Our experiments show that the predict method proposed in this paper achieved well prediction performance on ADFALD intrusion detection test data set. Moreover, the predicted sequence, combined with the known invoked traces of system, significantly improves the performance of intrusion detection verified on various classifiers.