h-index58
179papers
7,745citations
Novelty54%
AI Score63

179 Papers

CVMar 15, 2023Code
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Wei Lin, Leonid Karlinsky, Nina Shvetsova et al. · ibm-research, mit

Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{https://github.com/wlin-at/MAXI}.

CVNov 24, 2022Code
Video Test-Time Adaptation for Action Recognition

Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski et al. · ibm-research, mit

Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. Code will be available at \url{https://github.com/wlin-at/ViTTA}.

LGMay 30, 2022
CGMN: A Contrastive Graph Matching Network for Self-Supervised Graph Similarity Learning

Di Jin, Luzhi Wang, Yizhen Zheng et al. · mit

Graph similarity learning refers to calculating the similarity score between two graphs, which is required in many realistic applications, such as visual tracking, graph classification, and collaborative filtering. As most of the existing graph neural networks yield effective graph representations of a single graph, little effort has been made for jointly learning two graph representations and calculating their similarity score. In addition, existing unsupervised graph similarity learning methods are mainly clustering-based, which ignores the valuable information embodied in graph pairs. To this end, we propose a contrastive graph matching network (CGMN) for self-supervised graph similarity learning in order to calculate the similarity between any two input graph objects. Specifically, we generate two augmented views for each graph in a pair respectively. Then, we employ two strategies, namely cross-view interaction and cross-graph interaction, for effective node representation learning. The former is resorted to strengthen the consistency of node representations in two views. The latter is utilized to identify node differences between different graphs. Finally, we transform node representations into graph-level representations via pooling operations for graph similarity computation. We have evaluated CGMN on eight real-world datasets, and the experiment results show that the proposed new approach is superior to the state-of-the-art methods in graph similarity learning downstream tasks.

CLApr 30, 2022Code
EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing

Chengyu Wang, Minghui Qiu, Chen Shi et al.

The success of Pre-Trained Models (PTMs) has reshaped the development of Natural Language Processing (NLP). Yet, it is not easy to obtain high-performing models and deploy them online for industrial practitioners. To bridge this gap, EasyNLP is designed to make it easy to build NLP applications, which supports a comprehensive suite of NLP algorithms. It further features knowledge-enhanced pre-training, knowledge distillation and few-shot learning functionalities for large-scale PTMs, and provides a unified framework of model training, inference and deployment for real-world applications. Currently, EasyNLP has powered over ten business units within Alibaba Group and is seamlessly integrated to the Platform of AI (PAI) products on Alibaba Cloud. The source code of our EasyNLP toolkit is released at GitHub (https://github.com/alibaba/EasyNLP).

LGJun 28, 2022
RAW-GNN: RAndom Walk Aggregation based Graph Neural Network

Di Jin, Rui Wang, Meng Ge et al. · mit

Graph-Convolution-based methods have been successfully applied to representation learning on homophily graphs where nodes with the same label or similar attributes tend to connect with one another. Due to the homophily assumption of Graph Convolutional Networks (GCNs) that these methods use, they are not suitable for heterophily graphs where nodes with different labels or dissimilar attributes tend to be adjacent. Several methods have attempted to address this heterophily problem, but they do not change the fundamental aggregation mechanism of GCNs because they rely on summation operators to aggregate information from neighboring nodes, which is implicitly subject to the homophily assumption. Here, we introduce a novel aggregation mechanism and develop a RAndom Walk Aggregation-based Graph Neural Network (called RAW-GNN) method. The proposed approach integrates the random walk strategy with graph neural networks. The new method utilizes breadth-first random walk search to capture homophily information and depth-first search to collect heterophily information. It replaces the conventional neighborhoods with path-based neighborhoods and introduces a new path-based aggregator based on Recurrent Neural Networks. These designs make RAW-GNN suitable for both homophily and heterophily graphs. Extensive experimental results showed that the new method achieved state-of-the-art performance on a variety of homophily and heterophily graphs.

CVMar 9, 2023
TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering

Wei Lin, Anna Kukleva, Horst Possegger et al. · ibm-research, mit

Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.

74.5AIMay 27
ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Zhexin Hu, Li Wang, Xiaohan Wang et al.

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

CVNov 23, 2022
ActMAD: Activation Matching to Align Distributions for Test-Time-Training

Muhammad Jehanzeb Mirza, Pol Jané Soneira, Wei Lin et al.

Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the distribution of entire channels in the ultimate layer of the feature extractor, we model the distribution of each feature in multiple layers across the network. This results in a more fine-grained supervision and makes ActMAD attain state of the art performance on CIFAR-100C and Imagenet-C. ActMAD is also architecture- and task-agnostic, which lets us go beyond image classification, and score 15.4% improvement over previous approaches when evaluating a KITTI-trained object detector on KITTI-Fog. Our experiments highlight that ActMAD can be applied to online adaptation in realistic scenarios, requiring little data to attain its full performance.

CVMay 3, 2022
Cross-View Cross-Scene Multi-View Crowd Counting

Qi Zhang, Wei Lin, Antoni B. Chan

Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera, capturing more people in the scene, and improve counting performance for occluded people or those in low resolution. However, the current multi-view paradigm trains and tests on the same single scene and camera-views, which limits its practical application. In this paper, we propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts. To dynamically handle the challenge of optimal view fusion under scene and camera layout change and non-correspondence noise due to camera calibration errors or erroneous features, we propose a CVCS model that attentively selects and fuses multiple views together using camera layout geometry, and a noise view regularization method to train the model to handle non-correspondence errors. We also generate a large synthetic multi-camera crowd counting dataset with a large number of scenes and camera views to capture many possible variations, which avoids the difficulty of collecting and annotating such a large real dataset. We then test our trained CVCS model on real multi-view counting datasets, by using unsupervised domain transfer. The proposed CVCS model trained on synthetic data outperforms the same model trained only on real data, and achieves promising performance compared to fully supervised methods that train and test on the same single scene.

CVNov 21, 2022
MATE: Masked Autoencoders are Online 3D Test-Time Learners

M. Jehanzeb Mirza, Inkyu Shin, Wei Lin et al.

Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. We show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, making it extremely lightweight. Our experiments show that MATE also achieves competitive performance by adapting sparsely on the test data, which further reduces its computational overhead, making it ideal for real-time applications.

91.3LGMay 25Code
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

Li Wang, Xiaodong Lu, Xiaohan Wang et al.

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

CVApr 2, 2023
DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks

Qiangqiang Wu, Tianyu Yang, Ziquan Liu et al.

This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks, i.e., object-level tracking tasks including video object tracking (VOT) and video object segmentation (VOS), self-supervised visual correspondence learning, dense tracking tasks including optical flow estimation and long-term point tracking, and 3D point cloud tracking. Specifically, our work explores to provide a general representation to boost the temporal matching ability in various downstream tracking tasks. To achieve this, we firstly find that a simple extension of MAE, which randomly masks out frame patches in videos and reconstruct the frame pixels, heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations. To alleviate this, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We obtain several important findings with DropMAE: 1) DropMAE is a strong and efficient temporal matching learner, which achieves better fine-tuning results on matching-based tasks than the ImageNet-based MAE with 2x faster pre-training speed. 2) DropMAE is effective for different tracking tasks, i.e., object-level matching tasks including VOT and VOS, dense tracking tasks including optical flow estimation and tracking any point (TAP), and even 3D tracking in the different modality of point cloud data. Since none exists, we build ViT-based trackers for different downstream tracking tasks, and our pre-trained DropMAE model can be directly loaded in these ViT-based trackers for fine-tuning without further modifications. Experiments on 6 downstream tracking tasks demonstrate the effectiveness of DropMAE as a general pre-trained representation for diverse tracking tasks.

99.4CLApr 13Code
Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

Zihao Cheng, Zeming Liu, Yingyu Shan et al.

While large language model--powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \textbf{Mem$^{\textbf{2}}$Evolve}, which integrates two core components: \textbf{Experience Memory} and \textbf{Asset Memory}. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53\% over standard LLMs, 11.80\% over agents evolving solely through experience, and 6.46\% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: https://buaa-irip-llm.github.io/Mem2Evolve.

LGFeb 1, 2023
TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Ziji Shi, Le Jiang, Ang Wang et al.

Tensor parallelism is an essential technique for distributed training of large neural networks. However, automatically determining an optimal tensor parallel strategy is challenging due to the gigantic search space, which grows exponentially with model size and tensor dimension. This prohibits the adoption of auto-parallel systems on larger models. We observe that neural networks usually contain repeated substructures, and build an automatic parallelism framework named TAPAS that eliminates redundant search efforts. TAPAS employs a divide-and-conquer approach that efficiently folds the search space by identifying those unique substructures. As a result, it runs at sub-linear complexity concerning the model size, making it a scalable solution for training large-scale networks. Our evaluations demonstrate that TAPAS outperforms the state-of-the-art automatic parallelism frameworks by up to $160\times$ in search speed on a wide range of models, and the performance of derived strategies is competitive or even better compared with the expert-engineered Megatron-LM library.

LGApr 11, 2023
Neural Delay Differential Equations: System Reconstruction and Image Classification

Qunxi Zhu, Yao Guo, Wei Lin

Neural Ordinary Differential Equations (NODEs), a framework of continuous-depth neural networks, have been widely applied, showing exceptional efficacy in coping with representative datasets. Recently, an augmented framework has been developed to overcome some limitations that emerged in the application of the original framework. In this paper, we propose a new class of continuous-depth neural networks with delay, named Neural Delay Differential Equations (NDDEs). To compute the corresponding gradients, we use the adjoint sensitivity method to obtain the delayed dynamics of the adjoint. Differential equations with delays are typically seen as dynamical systems of infinite dimension that possess more fruitful dynamics. Compared to NODEs, NDDEs have a stronger capacity of nonlinear representations. We use several illustrative examples to demonstrate this outstanding capacity. Firstly, we successfully model the delayed dynamics where the trajectories in the lower-dimensional phase space could be mutually intersected and even chaotic in a model-free or model-based manner. Traditional NODEs, without any argumentation, are not directly applicable for such modeling. Secondly, we achieve lower loss and higher accuracy not only for the data produced synthetically by complex models but also for the CIFAR10, a well-known image dataset. Our results on the NDDEs demonstrate that appropriately articulating the elements of dynamical systems into the network design is truly beneficial in promoting network performance.

CLOct 13, 2023
ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models

Haoran Luo, Haihong E, Zichen Tang et al.

Knowledge Base Question Answering (KBQA) aims to answer natural language questions over large-scale knowledge bases (KBs), which can be summarized into two crucial steps: knowledge retrieval and semantic parsing. However, three core challenges remain: inefficient knowledge retrieval, mistakes of retrieval adversely impacting semantic parsing, and the complexity of previous KBQA methods. To tackle these challenges, we introduce ChatKBQA, a novel and simple generate-then-retrieve KBQA framework, which proposes first generating the logical form with fine-tuned LLMs, then retrieving and replacing entities and relations with an unsupervised retrieval method, to improve both generation and retrieval more directly. Experimental results show that ChatKBQA achieves new state-of-the-art performance on standard KBQA datasets, WebQSP, and CWQ. This work can also be regarded as a new paradigm for combining LLMs with knowledge graphs (KGs) for interpretable and knowledge-required question answering. Our code is publicly available.

CVMar 27, 2023
AIR-DA: Adversarial Image Reconstruction for Unsupervised Domain Adaptive Object Detection

Kunyang Sun, Wei Lin, Haoqin Shi et al.

Unsupervised domain adaptive object detection is a challenging vision task where object detectors are adapted from a label-rich source domain to an unlabeled target domain. Recent advances prove the efficacy of the adversarial based domain alignment where the adversarial training between the feature extractor and domain discriminator results in domain-invariance in the feature space. However, due to the domain shift, domain discrimination, especially on low-level features, is an easy task. This results in an imbalance of the adversarial training between the domain discriminator and the feature extractor. In this work, we achieve a better domain alignment by introducing an auxiliary regularization task to improve the training balance. Specifically, we propose Adversarial Image Reconstruction (AIR) as the regularizer to facilitate the adversarial training of the feature extractor. We further design a multi-level feature alignment module to enhance the adaptation performance. Our evaluations across several datasets of challenging domain shifts demonstrate that the proposed method outperforms all previous methods, of both one- and two-stage, in most settings.

58.5CVApr 12Code
DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

Song Jin, Juntian Zhang, Xun Zhang et al.

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

LGMar 16, 2023
Embedding Theory of Reservoir Computing and Reducing Reservoir Network Using Time Delays

Xing-Yue Duan, Xiong Ying, Si-Yang Leng et al.

Reservoir computing (RC), a particular form of recurrent neural network, is under explosive development due to its exceptional efficacy and high performance in reconstruction or/and prediction of complex physical systems. However, the mechanism triggering such effective applications of RC is still unclear, awaiting deep and systematic exploration. Here, combining the delayed embedding theory with the generalized embedding theory, we rigorously prove that RC is essentially a high dimensional embedding of the original input nonlinear dynamical system. Thus, using this embedding property, we unify into a universal framework the standard RC and the time-delayed RC where we novelly introduce time delays only into the network's output layer, and we further find a trade-off relation between the time delays and the number of neurons in RC. Based on this finding, we significantly reduce the network size of RC for reconstructing and predicting some representative physical systems, and, more surprisingly, only using a single neuron reservoir with time delays is sometimes sufficient for achieving those tasks.

AIMar 3Code
SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Qi Zhang, Yifei Wang, Xiaohan Wang et al.

In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at https://github.com/PKU-ML/STS.

43.6CLMay 29
Are Full Rollouts Necessary for On-Policy Distillation?

Yaocheng Zhang, Jiajun Chai, Songjun Tu et al.

On-policy distillation (OPD) provides dense teacher feedback along rollouts generated by the student and has emerged as a promising post-training paradigm for long-horizon reasoning. However, standard OPD typically generates full rollouts during training, which is computationally expensive and may expose the student to unreliable teacher feedback at late rollout positions, especially during early training. We identify the rollout horizon as a key bottleneck in OPD that substantially impacts training efficiency. Unlike Reinforcement Learning with Verifiable Rewards (RLVR), OPD does not require a complete trajectory or a final answer reward to provide learning signals. This observation suggests that full rollouts may not always be necessary for effective OPD. Motivated by this insight, we propose two simple horizon-control strategies: Progressive OPD (POPD), which gradually expands the rollout horizon during training, and Truncated OPD (TOPD), which permanently performs distillation on reliable truncated rollouts. Experiments on mathematical reasoning show that POPD improves the training efficiency of OPD by up to 3$\times$, while TOPD matches OPD performance using only 10\% of the rollout horizon, leading to substantial wall-clock and memory reductions. These results demonstrate that controlling the rollout horizon offers a simple and practical path to more efficient OPD.

IRAug 7, 2023
Heterogeneous Knowledge Fusion: A Novel Approach for Personalized Recommendation via LLM

Bin Yin, Junjie Xie, Yu Qin et al.

The analysis and mining of user heterogeneous behavior are of paramount importance in recommendation systems. However, the conventional approach of incorporating various types of heterogeneous behavior into recommendation models leads to feature sparsity and knowledge fragmentation issues. To address this challenge, we propose a novel approach for personalized recommendation via Large Language Model (LLM), by extracting and fusing heterogeneous knowledge from user heterogeneous behavior information. In addition, by combining heterogeneous knowledge and recommendation tasks, instruction tuning is performed on LLM for personalized recommendations. The experimental results demonstrate that our method can effectively integrate user heterogeneous behavior and significantly improve recommendation performance.

97.6CLMay 18Code
Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

Li Wang, Xiaohan Wang, Xiaodong Lu et al.

Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87\%, 2.16\%, and 2.53\% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.

99.5LGApr 15
$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong et al.

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a multi-agent self-evolution framework. In $π$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.

CVMar 26, 2025Code
Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai et al.

This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

42.5LGMay 22
AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

Wei Lin, Yining Jiang, Qingyu Song et al.

Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

CVSep 13, 2023
TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin et al.

Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines.

LGFeb 13, 2023
Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment

Shiwei Zhang, Xiaodong Yi, Lansong Diao et al.

This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.

AIDec 18, 2025Code
ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

Hao Chen, Zhexin Hu, Jiajun Chai et al.

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

DCSep 19, 2023
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Haojun Xia, Zhen Zheng, Yuchao Li et al.

With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common approach to reduce both GPU memory footprint and the overall computation while retaining good model accuracy. However, the existing solutions do not provide a highly-efficient support for handling unstructured sparsity on modern GPUs, especially on the highly-structured Tensor Core hardware. Therefore, we propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference with the sophisticated support of unstructured sparsity on high-performance but highly restrictive Tensor Cores. Based on our key observation that the main bottleneck of generative model inference is the several skinny matrix multiplications for which Tensor Cores would be significantly under-utilized due to low computational intensity, we propose a general Load-as-Sparse and Compute-as-Dense methodology for unstructured sparse matrix multiplication. The basic insight is to address the significant memory bandwidth bottleneck while tolerating redundant computations that are not critical for end-to-end performance on Tensor Cores. Based on this, we design an effective software framework for Tensor Core based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively. At end-to-end framework level on OPT-30B/66B/175B models, for tokens per GPU-second, Flash-LLM achieves up to 3.8x and 3.6x improvement over DeepSpeed and FasterTransformer, respectively, with significantly lower inference cost.

93.1AIApr 19
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning

Jingbo Sun, Wenyue Chong, Songjun Tu et al.

Agentic retrieval-augmented generation (RAG) systems enable large language models (LLMs) to solve complex tasks through multi-step interaction with external retrieval tools. However, such multi-step interaction often involves redundant search steps, incurring substantial computational cost and latency. Prior work limits search depth (i.e., the number of search steps) to reduce cost, but this often leads to underexploration of complex questions. To address this, we first investigate how search depth affects accuracy and find a minimal sufficient search depth that defines an accuracy-efficiency trade-off, jointly determined by question complexity and the agent's capability. Furthermore, we propose AutoSearch, a reinforcement learning (RL) framework that evaluates each search step via self-generated intermediate answers. By a self-answering mechanism, AutoSearch identifies the minimal sufficient search depth and promotes efficient search by rewarding its attainment while penalizing over-searching. In addition, reward mechanisms are introduced to stabilize search behavior and improve answer quality on complex questions. Extensive experiments on multiple benchmarks show that AutoSearch achieves a superior accuracy-efficiency trade-off, alleviating over-searching while preserving search quality.

DCFeb 16, 2023
Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

Shiwei Zhang, Lansong Diao, Siyu Wang et al.

We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of tensor programs, which facilitates its generalization to unprecedented applications. Additionally, it implements a task-oriented controller and a distributed runtime for optimal performance. Rhino explores on a complete and systematic parallelization strategy space that comprises all the paradigms commonly employed in deep learning (DL), in addition to strided partitioning and pipeline parallelism on non-linear models. Aiming to efficiently search for a near-optimal parallel execution plan, our analysis of production clusters reveals general heuristics to speed up the strategy search. On top of it, two optimization levels are designed to offer users flexible trade-offs between the search time and strategy quality. Our experiments demonstrate that Rhino can not only re-discover the expert-crafted strategies of classic, research and production DL models, but also identify novel parallelization strategies which surpass existing systems for novel models.

CVOct 7, 2022
Multi-Frequency-Aware Patch Adversarial Learning for Neural Point Cloud Rendering

Jay Karhade, Haiyue Zhu, Ka-Shing Chung et al.

We present a neural point cloud rendering pipeline through a novel multi-frequency-aware patch adversarial learning framework. The proposed approach aims to improve the rendering realness by minimizing the spectrum discrepancy between real and synthesized images, especially on the high-frequency localized sharpness information which causes image blur visually. Specifically, a patch multi-discriminator scheme is proposed for the adversarial learning, which combines both spectral domain (Fourier Transform and Discrete Wavelet Transform) discriminators as well as the spatial (RGB) domain discriminator to force the generator to capture global and local spectral distributions of the real images. The proposed multi-discriminator scheme not only helps to improve rendering realness, but also enhance the convergence speed and stability of adversarial learning. Moreover, we introduce a noise-resistant voxelisation approach by utilizing both the appearance distance and spatial distance to exclude the spatial outlier points caused by depth noise. Our entire architecture is fully differentiable and can be learned in an end-to-end fashion. Extensive experiments show that our method produces state-of-the-art results for neural point cloud rendering by a significant margin. Our source code will be made public at a later date.

82.9IRMay 12Code
RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems

Wenwen Zeng, Jinhui Zhang, Hao Chen et al.

The integration of Large Language Model (LLM) agents is transforming recommender systems from simple query-item matching towards deeply personalized and interactive recommendations. Reinforcement Learning (RL) provides an essential framework for the optimization of these agents in recommendation tasks. However, current methodologies remain limited by a reliance on single dimensional outcome-based rewards that focus exclusively on final user interactions, overlooking critical intermediate capabilities, such as instruction following and complex intent understanding. Despite the necessity for designing multi-dimensional reward, the field lacks a standardized benchmark to facilitate this development. To bridge this gap, we introduce RecRM-Bench, the largest and most comprehensive benchmark to date for agentic recommender systems. It comprises over 1 million structured entries across four core evaluation dimensions: instruction following, factual consistency, query-item relevance, and fine-grained user behavior prediction. By supporting comprehensive assessment from syntactic compliance to complex intent grounding and preference modeling, RecRM-Bench provides a foundational dataset for training sophisticated reward models. Furthermore, we propose a systematic framework for the construction of multi-dimensional reward models and the integration of a hybrid reward function, establishing a robust foundation for developing reliable and highly capable agentic recommender systems. The complete RecRM-Bench dataset is publicly available at https://huggingface.co/datasets/wwzeng/RecRM-Bench.

40.5CVMar 14Code
GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference

Zongshun Zhang, Yao Liu, Qiao Liu et al.

Video-based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise. To address this issue, we propose GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity-related noise. Experiments on three public datasets, covering both high- and low-stakes scenarios, show that GenLie consistently outperforms state-of-the-art methods. Source code is available at https://github.com/AliasDictusZ1/GenLie.

AIJan 12
Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents

Miao Su, Yucan Guo, Zhongni Hou et al.

Memory enables Large Language Model (LLM) agents to perceive, store, and use information from past dialogues, which is essential for personalization. However, existing methods fail to properly model the temporal dimension of memory in two aspects: 1) Temporal inaccuracy: memories are organized by dialogue time rather than their actual occurrence time; 2) Temporal fragmentation: existing methods focus on point-wise memory, losing durative information that captures persistent states and evolving patterns. To address these limitations, we propose Temporal Semantic Memory (TSM), a memory framework that models semantic time for point-wise memory and supports the construction and utilization of durative memory. During memory construction, it first builds a semantic timeline rather than a dialogue one. Then, it consolidates temporally continuous and semantically related information into a durative memory. During memory utilization, it incorporates the query's temporal intent on the semantic timeline, enabling the retrieval of temporally appropriate durative memories and providing time-valid, duration-consistent context to support response generation. Experiments on LongMemEval and LoCoMo show that TSM consistently outperforms existing methods and achieves up to 12.2% absolute improvement in accuracy, demonstrating the effectiveness of the proposed method.

RODec 4, 2025
From Generated Human Videos to Physically Plausible Robot Trajectories

James Ni, Zekai Wang, Wei Lin et al.

Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.

LGSep 24, 2023
Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

Guo-qing Jiang, Jinlong Liu, Zixiang Ding et al.

As models for nature language processing (NLP), computer vision (CV) and recommendation systems (RS) require surging computation, a large number of GPUs/TPUs are paralleled as a large batch (LB) to improve training throughput. However, training such LB tasks often meets large generalization gap and downgrades final precision, which limits enlarging the batch size. In this work, we develop the variance reduced gradient descent technique (VRGD) based on the gradient signal to noise ratio (GSNR) and apply it onto popular optimizers such as SGD/Adam/LARS/LAMB. We carry out a theoretical analysis of convergence rate to explain its fast training dynamics, and a generalization analysis to demonstrate its smaller generalization gap on LB training. Comprehensive experiments demonstrate that VRGD can accelerate training ($1\sim 2 \times$), narrow generalization gap and improve final accuracy. We push the batch size limit of BERT pretraining up to 128k/64k and DLRM to 512k without noticeable accuracy loss. We improve ImageNet Top-1 accuracy at 96k by $0.52pp$ than LARS. The generalization gap of BERT and ImageNet training is significantly reduce by over $65\%$.

AIOct 8, 2023
Text2NKG: Fine-Grained N-ary Relation Extraction for N-ary relational Knowledge Graph Construction

Haoran Luo, Haihong E, Yuhao Yang et al.

Beyond traditional binary relational facts, n-ary relational knowledge graphs (NKGs) are comprised of n-ary relational facts containing more than two entities, which are closer to real-world facts with broader applications. However, the construction of NKGs remains at a coarse-grained level, which is always in a single schema, ignoring the order and variable arity of entities. To address these restrictions, we propose Text2NKG, a novel fine-grained n-ary relation extraction framework for n-ary relational knowledge graph construction. We introduce a span-tuple classification approach with hetero-ordered merging and output merging to accomplish fine-grained n-ary relation extraction in different arity. Furthermore, Text2NKG supports four typical NKG schemas: hyper-relational schema, event-based schema, role-based schema, and hypergraph-based schema, with high flexibility and practicality. The experimental results demonstrate that Text2NKG achieves state-of-the-art performance in F1 scores on the fine-grained n-ary relation extraction benchmark. Our code and datasets are publicly available.

LGJul 29, 2024
Adaptive Self-supervised Robust Clustering for Unstructured Data with Unknown Cluster Number

Chen-Lu Ding, Jiancan Wu, Wei Lin et al.

We introduce a novel self-supervised deep clustering approach tailored for unstructured data without requiring prior knowledge of the number of clusters, termed Adaptive Self-supervised Robust Clustering (ASRC). In particular, ASRC adaptively learns the graph structure and edge weights to capture both local and global structural information. The obtained graph enables us to learn clustering-friendly feature representations by an enhanced graph auto-encoder with contrastive learning technique. It further leverages the clustering results adaptively obtained by robust continuous clustering (RCC) to generate prototypes for negative sampling, which can further contribute to promoting consistency among positive pairs and enlarging the gap between positive and negative samples. ASRC obtains the final clustering results by applying RCC to the learned feature representations with their consistent graph structure and edge weights. Extensive experiments conducted on seven benchmark datasets demonstrate the efficacy of ASRC, demonstrating its superior performance over other popular clustering models. Notably, ASRC even outperforms methods that rely on prior knowledge of the number of clusters, highlighting its effectiveness in addressing the challenges of clustering unstructured data.

LGJan 13
Your Group-Relative Advantage Is Biased

Fengkai Yang, Zherui Chen, Xiaohan Wang et al.

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.

98.1LGMay 1Code
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

Zihan Lin, Xiaohan Wang, Jie Cao et al.

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.

IRAug 21, 2024
LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding

Zhizhong Wan, Bin Yin, Junjie Xie et al.

Click-Through Rate (CTR) prediction is crucial for Recommendation System(RS), aiming to provide personalized recommendation services for users in many aspects such as food delivery, e-commerce and so on. However, traditional RS relies on collaborative signals, which lacks semantic understanding to real-time scenes. We also noticed that a major challenge in utilizing Large Language Models (LLMs) for practical recommendation purposes is their efficiency in dealing with long text input. To break through the problems above, we propose Large Language Model Aided Real-time Scene Recommendation(LARR), adopt LLMs for semantic understanding, utilizing real-time scene information in RS without requiring LLM to process the entire real-time scene text directly, thereby enhancing the efficiency of LLM-based CTR modeling. Specifically, recommendation domain-specific knowledge is injected into LLM and then RS employs an aggregation encoder to build real-time scene information from separate LLM's outputs. Firstly, a LLM is continual pretrained on corpus built from recommendation data with the aid of special tokens. Subsequently, the LLM is fine-tuned via contrastive learning on three kinds of sample construction strategies. Through this step, LLM is transformed into a text embedding model. Finally, LLM's separate outputs for different scene features are aggregated by an encoder, aligning to collaborative signals in RS, enhancing the performance of recommendation model.

58.6CLMay 20
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Zihao Cheng, Hongru Wang, Zeming Liu et al.

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

LGFeb 4, 2024Code
Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning

Yi Cheng, Renjun Hu, Haochao Ying et al.

Until recently, the question of the effective inductive bias of deep models on tabular data has remained unanswered. This paper investigates the hypothesis that arithmetic feature interaction is necessary for deep tabular learning. To test this point, we create a synthetic tabular dataset with a mild feature interaction assumption and examine a modified transformer architecture enabling arithmetical feature interactions, referred to as AMFormer. Results show that AMFormer outperforms strong counterparts in fine-grained tabular data modeling, data efficiency in training, and generalization. This is attributed to its parallel additive and multiplicative attention operators and prompt-based optimization, which facilitate the separation of tabular samples in an extended space with arithmetically-engineered features. Our extensive experiments on real-world data also validate the consistent effectiveness, efficiency, and rationale of AMFormer, suggesting it has established a strong inductive bias for deep learning on tabular data. Code is available at https://github.com/aigc-apps/AMFormer.

28.0LGMar 16
Ultra-Early Prediction of Tipping Points: Integrating Dynamical Measures with Reservoir Computing

Xin Li, Qunxi Zhu, Chengli Zhao et al.

Complex dynamical systems-such as climate, ecosystems, and economics-can undergo catastrophic and potentially irreversible regime changes, often triggered by environmental parameter drift and stochastic disturbances. These critical thresholds, known as tipping points, pose a prediction problem of both theoretical and practical significance, yet remain largely unresolved. To address this, we articulate a model-free framework that integrates the measures characterizing the stability and sensitivity of dynamical systems with the reservoir computing (RC), a lightweight machine learning technique, using only observational time series data. The framework consists of two stages. The first stage involves using RC to robustly learn local complex dynamics from observational data segmented into windows. The second stage focuses on accurately detecting early warning signals of tipping points by analyzing the learned autonomous RC dynamics through dynamical measures, including the dominant eigenvalue of the Jacobian matrix, the maximum Floquet multiplier, and the maximum Lyapunov exponent. Furthermore, when these dynamical measures exhibit trend-like patterns, their extrapolation enables ultra-early prediction of tipping points significantly prior to the occurrence of critical transitions. We conduct a rigorous theoretical analysis of the proposed method and perform extensive numerical evaluations on a series of representative synthetic systems and eight real-world datasets, as well as quantitatively predict the tipping time of the Atlantic Meridional Overturning Circulation system. Experimental results demonstrate that our framework exhibits advantages over the baselines in comprehensive evaluations, particularly in terms of dynamical interpretability, prediction stability and robustness, and ultra-early prediction capability.

CVOct 14, 2024Code
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh et al.

The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.

32.4CVMar 17
Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

Jiyang Huang, Hongru Cheng, Wei Lin et al.

Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

LGFeb 9
Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Xiaodong Lu, Xiaohan Wang, Jiajun Chai et al.

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

96.3AIMay 18
AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

Zhenlin Wei, Pu Jian, Yingzhuo Deng et al.

The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.