Erpeng Xue

IR
h-index5
5papers
5citations
Novelty53%
AI Score52

5 Papers

55.5AIMay 27
SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Hongxiang Lin, Zhirui Kuai, Erpeng Xue et al.

Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.

86.7LGMay 19Code
When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

Hongxiang Lin, Zhirui Kuai, Erpeng Xue et al.

Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose \textbf{TTRL-Guard}, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54\% on AIME 2025. \footnote{Our code and implementation details are available at https://github.com/linhxkkkk/TTRL-Guard.

IRMay 22, 2025Code
Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation

Hao Guo, Erpeng Xue, Lei Huang et al.

Deep Learning Recommendation Models (DLRMs) often rely on extensive manual feature engineering to improve accuracy and user experience, which increases system complexity and limits scalability of model performance with respect to computational resources. Recently, Meta introduced a generative ranking paradigm based on HSTU block that enables end-to-end learning from raw user behavior sequences and demonstrates scaling law on large datasets that can be regarded as the state-of-the-art (SOTA). However, splitting user behaviors into interleaved item and action information significantly increases the input sequence length, which adversely affects both training and inference efficiency. To address this issue, we propose the Dual-Flow Generative Ranking Network (DFGR), that employs a dual-flow mechanism to optimize interaction modeling, ensuring efficient training and inference through end-to-end token processing. DFGR duplicates the original user behavior sequence into a real flow and a fake flow based on the authenticity of the action information, and then defines a novel interaction method between the real flow and the fake flow within the QKV module of the self-attention mechanism. This design reduces computational overhead and improves both training efficiency and inference performance compared to Meta's HSTU-based model. Experiments on both open-source and real industrial datasets show that DFGR outperforms DLRM, which serves as the industrial online baseline with extensive feature engineering, as well as Meta's HSTU and other common recommendation models such as DIN, DCN, DIEN, and DeepFM. Furthermore, we investigate optimal parameter allocation strategies under computational constraints, establishing DFGR as an efficient and effective next-generation generative ranking paradigm.

IRAug 4, 2025
Dynamic Forgetting and Spatio-Temporal Periodic Interest Modeling for Local-Life Service Recommendation

Zhaoyu Hu, Jianyang Wang, Hao Guo et al.

In the context of the booming digital economy, recommendation systems, as a key link connecting users and numerous services, face challenges in modeling user behavior sequences on local-life service platforms, including the sparsity of long sequences and strong spatio-temporal dependence. Such challenges can be addressed by drawing an analogy to the forgetting process in human memory. This is because users' responses to recommended content follow the recency effect and the cyclicality of memory. By exploring this, this paper introduces the forgetting curve and proposes Spatio-Temporal periodic Interest Modeling (STIM) with long sequences for local-life service recommendation. STIM integrates three key components: a dynamic masking module based on the forgetting curve, which is used to extract both recent spatiotemporal features and periodic spatiotemporal features; a query-based mixture of experts (MoE) approach that can adaptively activate expert networks under different dynamic masks, enabling the collaborative modeling of time, location, and items; and a hierarchical multi-interest network unit, which captures multi-interest representations by modeling the hierarchical interactions between the shallow and deep semantics of users' recent behaviors. By introducing the STIM method, we conducted online A/B tests and achieved a 1.54\% improvement in gross transaction volume (GTV). In addition, extended offline experiments also showed improvements. STIM has been deployed in a large-scale local-life service recommendation system, serving hundreds of millions of daily active users in core application scenarios.

IRAug 1, 2025
When Relevance Meets Novelty: Dual-Stable Periodic Optimization for Exploratory Recommendation

Hongxiang Lin, Hao Guo, Zeshun Li et al.

Traditional recommendation systems tend to trap users in strong feedback loops by excessively pushing content aligned with their historical preferences, thereby limiting exploration opportunities and causing content fatigue. Although large language models (LLMs) demonstrate potential with their diverse content generation capabilities, existing LLM-enhanced dual-model frameworks face two major limitations: first, they overlook long-term preferences driven by group identity, leading to biased interest modeling; second, they suffer from static optimization flaws, as a one-time alignment process fails to leverage incremental user data for closed-loop optimization. To address these challenges, we propose the Co-Evolutionary Alignment (CoEA) method. For interest modeling bias, we introduce Dual-Stable Interest Exploration (DSIE) module, jointly modeling long-term group identity and short-term individual interests through parallel processing of behavioral sequences. For static optimization limitations, we design a Periodic Collaborative Optimization (PCO) mechanism. This mechanism regularly conducts preference verification on incremental data using the Relevance LLM, then guides the Novelty LLM to perform fine-tuning based on the verification results, and subsequently feeds back the output of the incrementally fine-tuned Novelty LLM to the Relevance LLM for re-evaluation, thereby achieving a dynamic closed-loop optimization. Extensive online and offline experiments verify the effectiveness of the CoEA model in exploratory recommendation.