Weiyu Cheng

CL
h-index40
9papers
686citations
Novelty49%
AI Score55

9 Papers

IROct 28, 2022
RESUS: Warm-Up Cold Users via Meta-Learning Residual User Preferences in CTR Prediction

Yanyan Shen, Lifan Zhao, Weiyu Cheng et al.

Click-Through Rate (CTR) prediction on cold users is a challenging task in recommender systems. Recent researches have resorted to meta-learning to tackle the cold-user challenge, which either perform few-shot user representation learning or adopt optimization-based meta-learning. However, existing methods suffer from information loss or inefficient optimization process, and they fail to explicitly model global user preference knowledge which is crucial to complement the sparse and insufficient preference information of cold users. In this paper, we propose a novel and efficient approach named RESUS, which decouples the learning of global preference knowledge contributed by collective users from the learning of residual preferences for individual users. Specifically, we employ a shared predictor to infer basis user preferences, which acquires global preference knowledge from the interactions of different users. Meanwhile, we develop two efficient algorithms based on the nearest neighbor and ridge regression predictors, which infer residual user preferences via learning quickly from a few user-specific interactions. Extensive experiments on three public datasets demonstrate that our RESUS approach is efficient and effective in improving CTR prediction accuracy on cold users, compared with various state-of-the-art methods.

CLJan 15
OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Deming Ding, Shichun Liu, Enhui Yang et al.

Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.

89.5AIMay 26
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax, Aili Chen, Aonian Li et al.

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

CLJun 16, 2025Code
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, Aili Chen, Aonian Li et al.

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

CLSep 8, 2025Code
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Junteng Liu, Yunji Li, Chi Zhang et al.

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

CLJan 14, 2025
MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax, Aonian Li, Bangwei Gong et al.

We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

LGJun 8, 2020
Differentiable Neural Input Search for Recommender Systems

Weiyu Cheng, Yanyan Shen, Linpeng Huang

Latent factor models are the driving forces of the state-of-the-art recommender systems, with an important insight of vectorizing raw input features into dense embeddings. The dimensions of different feature embeddings are often set to a same value empirically, which limits the predictive performance of latent factor models. Existing works have proposed heuristic or reinforcement learning-based methods to search for mixed feature embedding dimensions. For efficiency concern, these methods typically choose embedding dimensions from a restricted set of candidate dimensions. However, this restriction will hurt the flexibility of dimension selection, leading to suboptimal performance of search results. In this paper, we propose Differentiable Neural Input Search (DNIS), a method that searches for mixed feature embedding dimensions in a more flexible space through continuous relaxation and differentiable optimization. The key idea is to introduce a soft selection layer that controls the significance of each embedding dimension, and optimize this layer according to model's validation performance. DNIS is model-agnostic and thus can be seamlessly incorporated with existing latent factor models for recommendation. We conduct experiments with various architectures of latent factor models on three public real-world datasets for rating prediction, Click-Through-Rate (CTR) prediction, and top-k item recommendation. The results demonstrate that our method achieves the best predictive performance compared with existing neural input search approaches with fewer embedding parameters and less time cost.

LGSep 7, 2019
Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions

Weiyu Cheng, Yanyan Shen, Linpeng Huang

Various factorization-based methods have been proposed to leverage second-order, or higher-order cross features for boosting the performance of predictive models. They generally enumerate all the cross features under a predefined maximum order, and then identify useful feature interactions through model training, which suffer from two drawbacks. First, they have to make a trade-off between the expressiveness of higher-order cross features and the computational cost, resulting in suboptimal predictions. Second, enumerating all the cross features, including irrelevant ones, may introduce noisy feature combinations that degrade model performance. In this work, we propose the Adaptive Factorization Network (AFN), a new model that learns arbitrary-order cross features adaptively from data. The core of AFN is a logarithmic transformation layer to convert the power of each feature in a feature combination into the coefficient to be learned. The experimental results on four real datasets demonstrate the superior predictive performance of AFN against the start-of-the-arts.

LGNov 20, 2018
Explaining Latent Factor Models for Recommendation with Influence Functions

Weiyu Cheng, Yanyan Shen, Yanmin Zhu et al.

Latent factor models (LFMs) such as matrix factorization achieve the state-of-the-art performance among various Collaborative Filtering (CF) approaches for recommendation. Despite the high recommendation accuracy of LFMs, a critical issue to be resolved is the lack of explainability. Extensive efforts have been made in the literature to incorporate explainability into LFMs. However, they either rely on auxiliary information which may not be available in practice, or fail to provide easy-to-understand explanations. In this paper, we propose a fast influence analysis method named FIA, which successfully enforces explicit neighbor-style explanations to LFMs with the technique of influence functions stemmed from robust statistics. We first describe how to employ influence functions to LFMs to deliver neighbor-style explanations. Then we develop a novel influence computation algorithm for matrix factorization with high efficiency. We further extend it to the more general neural collaborative filtering and introduce an approximation algorithm to accelerate influence analysis over neural network models. Experimental results on real datasets demonstrate the correctness, efficiency and usefulness of our proposed method.