CLApr 27
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent BenchmarksXinming Tu, Tianze Wang, Yingzhou et al.
As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard cross-verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench - including fatal errors rendering tasks unsolvable - and exactly matched 83.3% of expert-identified issues on the BIXBench Verified-50 subset, catching defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing a practical and valuable complement to human review. These findings point toward AI-assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.
LGOct 16, 2024Code
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language ModelsPeng Xia, Kangyu Zhu, Haoran Li et al.
Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG.
LGFeb 19
Be Wary of Your Time Series PreprocessingSofiane Ennadir, Tianze Wang, Oleg Smirnov et al.
Normalization and scaling are fundamental preprocessing steps in time series modeling, yet their role in Transformer-based models remains underexplored from a theoretical perspective. In this work, we present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures for time series representation learning. We propose a novel expressivity framework tailored to time series, which quantifies a model's ability to distinguish between similar and dissimilar inputs in the representation space. Using this framework, we derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling. Our analysis reveals that the choice of normalization strategy can significantly influence the model's representational capacity, depending on the task and data characteristics. We complement our theory with empirical validation on classification and forecasting benchmarks using multiple Transformer-based models. Our results show that no single normalization method consistently outperforms others, and in some cases, omitting normalization entirely leads to superior performance. These findings highlight the critical role of preprocessing in time series learning and motivate the need for more principled normalization strategies tailored to specific tasks and datasets.
LGNov 6, 2025
Frequency Matters: When Time Series Foundation Models Fail Under Spectral ShiftTianze Wang, Sofiane Ennadir, John Pertoft et al.
Time series foundation models (TSFMs) have shown strong results on public benchmarks, prompting comparisons to a "BERT moment" for time series. Their effectiveness in industrial settings, however, remains uncertain. We examine why TSFMs often struggle to generalize and highlight spectral shift (a mismatch between the dominant frequency components in downstream tasks and those represented during pretraining) as a key factor. We present evidence from an industrial-scale player engagement prediction task in mobile gaming, where TSFMs underperform domain-adapted baselines. To isolate the mechanism, we design controlled synthetic experiments contrasting signals with seen versus unseen frequency bands, observing systematic degradation under spectral mismatch. These findings position frequency awareness as critical for robust TSFM deployment and motivate new pretraining and evaluation protocols that explicitly account for spectral diversity.
CVNov 13, 2025
HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and InteractionYueran Zhao, Zhang Zhang, Chao Sun et al.
Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.
CLFeb 16, 2025
RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference OptimizationTianci Liu, Haoxiang Jiang, Tianze Wang et al.
Large language models (LLMs) have achieved impressive performance but face high computational costs and latency, limiting their deployment in resource-constrained settings. In contrast, small-scale LLMs (SLMs) are more efficient yet struggle to capture evolving real-world knowledge. Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs. We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization. RoseRAG employs multi-turn prompting for detailed reasoning, rejection sampling for high-quality explanations, and contrastive preference selection to refine responses by maximizing the likelihood gap between preferred and non-preferred outputs. By integrating these components into a margin-aware optimization process, RoseRAG robustly enhances the accuracy and reliability of SLMs for RAG applications. Extensive experiments on three open-domain question answering benchmarks indicate that our innovative RoseRAG surpasses state-of-the-art baselines significantly.
LGApr 5, 2024
player2vec: A Language Modeling Approach to Understand Player Behavior in GamesTianze Wang, Maryam Honari-Jahromi, Styliani Katsarou et al.
Methods for learning latent user representations from historical behavior logs have gained traction for recommendation tasks in e-commerce, content streaming, and other settings. However, this area still remains relatively underexplored in video and mobile gaming contexts. In this work, we present a novel method for overcoming this limitation by extending a long-range Transformer model from the natural language processing domain to player behavior data. We discuss specifics of behavior tracking in games and propose preprocessing and tokenization approaches by viewing in-game events in an analogous way to words in sentences, thus enabling learning player representations in a self-supervised manner in the absence of ground-truth annotations. We experimentally demonstrate the efficacy of the proposed approach in fitting the distribution of behavior events by evaluating intrinsic language modeling metrics. Furthermore, we qualitatively analyze the emerging structure of the learned embedding space and show its value for generating insights into behavior patterns to inform downstream applications.
CLFeb 25, 2025
MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference AlignmentTianze Wang, Dongnan Gui, Yifan Hu et al.
Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.
LGDec 5, 2024
Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric ReviewSofiane Ennadir, Gabriela Zarzar Gandler, Filip Cornell et al.
Graphs are ubiquitous in real-world applications, ranging from social networks to biological systems, and have inspired the development of Graph Neural Networks (GNNs) for learning expressive representations. While most research has centered on static graphs, many real-world scenarios involve dynamic, temporally evolving graphs, motivating the need for Continuous-Time Dynamic Graph (CTDG) models. This paper provides a comprehensive review of Graph Representation Learning (GRL) on CTDGs with a focus on Self-Supervised Representation Learning (SSRL). We introduce a novel theoretical framework that analyzes the expressivity of CTDG models through an Information-Flow (IF) lens, quantifying their ability to propagate and encode temporal and structural information. Leveraging this framework, we categorize existing CTDG methods based on their suitability for different graph types and application scenarios. Within the same scope, we examine the design of SSRL methods tailored to CTDGs, such as predictive and contrastive approaches, highlighting their potential to mitigate the reliance on labeled data. Empirical evaluations on synthetic and real-world datasets validate our theoretical insights, demonstrating the strengths and limitations of various methods across long-range, bi-partite and community-based graphs. This work offers both a theoretical foundation and practical guidance for selecting and developing CTDG models, advancing the understanding of GRL in dynamic settings.
CVMar 13, 2025
HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via TransformerZhang Zhang, Chao Sun, Chao Yue et al.
Roadside vision centric 3D object detection has received increasing attention in recent years. It expands the perception range of autonomous vehicles, enhances the road safety. Previous methods focused on predicting per-pixel height rather than depth, making significant gains in roadside visual perception. While it is limited by the perspective property of near-large and far-small on image features, making it difficult for network to understand real dimension of objects in the 3D world. BEV features and voxel features present the real distribution of objects in 3D world compared to the image features. However, BEV features tend to lose details due to the lack of explicit height information, and voxel features are computationally expensive. Inspired by this insight, an efficient framework learning height prediction in voxel features via transformer is proposed, dubbed HeightFormer. It groups the voxel features into local height sequences, and utilize attention mechanism to obtain height distribution prediction. Subsequently, the local height sequences are reassembled to generate accurate 3D features. The proposed method is applied to two large-scale roadside benchmarks, DAIR-V2X-I and Rope3D. Extensive experiments are performed and the HeightFormer outperforms the state-of-the-art methods in roadside vision centric 3D object detection task.
CVJun 3, 2025
RoadFormer : Local-Global Feature Fusion for Road Surface Classification in Autonomous DrivingTianze Wang, Zhang Zhang, Chao Sun
The classification of the type of road surface (RSC) aims to utilize pavement features to identify the roughness, wet and dry conditions, and material information of the road surface. Due to its ability to effectively enhance road safety and traffic management, it has received widespread attention in recent years. In autonomous driving, accurate RSC allows vehicles to better understand the road environment, adjust driving strategies, and ensure a safer and more efficient driving experience. For a long time, vision-based RSC has been favored. However, existing visual classification methods have overlooked the exploration of fine-grained classification of pavement types (such as similar pavement textures). In this work, we propose a pure vision-based fine-grained RSC method for autonomous driving scenarios, which fuses local and global feature information through the stacking of convolutional and transformer modules. We further explore the stacking strategies of local and global feature extraction modules to find the optimal feature extraction strategy. In addition, since fine-grained tasks also face the challenge of relatively large intra-class differences and relatively small inter-class differences, we propose a Foreground-Background Module (FBM) that effectively extracts fine-grained context features of the pavement, enhancing the classification ability for complex pavements. Experiments conducted on a large-scale pavement dataset containing one million samples and a simplified dataset reorganized from this dataset achieved Top-1 classification accuracies of 92.52% and 96.50%, respectively, improving by 5.69% to 12.84% compared to SOTA methods. These results demonstrate that RoadFormer outperforms existing methods in RSC tasks, providing significant progress in improving the reliability of pavement perception in autonomous driving systems.
CVMay 8, 2025
PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space ModelZhang Zhang, Chao Sun, Chao Yue et al.
Serving the Intelligent Transport System (ITS) and Vehicle-to-Everything (V2X) tasks, roadside perception has received increasing attention in recent years, as it can extend the perception range of connected vehicles and improve traffic safety. However, roadside point cloud oriented 3D object detection has not been effectively explored. To some extent, the key to the performance of a point cloud detector lies in the receptive field of the network and the ability to effectively utilize the scene context. The recent emergence of Mamba, based on State Space Model (SSM), has shaken up the traditional convolution and transformers that have long been the foundational building blocks, due to its efficient global receptive field. In this work, we introduce Mamba to pillar-based roadside point cloud perception and propose a framework based on Cross-stage State-space Group (CSG), called PillarMamba. It enhances the expressiveness of the network and achieves efficient computation through cross-stage feature fusion. However, due to the limitations of scan directions, state space model faces local connection disrupted and historical relationship forgotten. To address this, we propose the Hybrid State-space Block (HSB) to obtain the local-global context of roadside point cloud. Specifically, it enhances neighborhood connections through local convolution and preserves historical memory through residual attention. The proposed method outperforms the state-of-the-art methods on the popular large scale roadside benchmark: DAIR-V2X-I. The code will be released soon.
CROct 13, 2025
Secret-Protected Evolution for Differentially Private Synthetic Text GenerationTianze Wang, Zhaoyu Chen, Jian Du et al.
Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI). A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection. Theoretically, we show that SecPE satisfies $(\mathrm{p}, \mathrm{r})$-secret protection, constituting a relaxation of Gaussian DP that enables tighter utility-privacy trade-offs, while also substantially reducing computational complexity relative to baseline methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower Fréchet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection. Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.
LGOct 2, 2025
Pool Me Wisely: On the Effect of Pooling in Transformer-Based ModelsSofiane Ennadir, Levente Zólyomi, Oleg Smirnov et al.
Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.
CVAug 2, 2025
RoadMamba: A Dual Branch Visual State Space Model for Road Surface ClassificationTianze Wang, Zhang Zhang, Chao Yue et al.
Acquiring the road surface conditions in advance based on visual technologies provides effective information for the planning and control system of autonomous vehicles, thus improving the safety and driving comfort of the vehicles. Recently, the Mamba architecture based on state-space models has shown remarkable performance in visual processing tasks, benefiting from the efficient global receptive field. However, existing Mamba architectures struggle to achieve state-of-the-art visual road surface classification due to their lack of effective extraction of the local texture of the road surface. In this paper, we explore for the first time the potential of visual Mamba architectures for road surface classification task and propose a method that effectively combines local and global perception, called RoadMamba. Specifically, we utilize the Dual State Space Model (DualSSM) to effectively extract the global semantics and local texture of the road surface and decode and fuse the dual features through the Dual Attention Fusion (DAF). In addition, we propose a dual auxiliary loss to explicitly constrain dual branches, preventing the network from relying only on global semantic information from the deep large receptive field and ignoring the local texture. The proposed RoadMamba achieves the state-of-the-art performance in experiments on a large-scale road surface classification dataset containing 1 million samples.
DCJun 14, 2025
Efficient Unified Caching for Accelerating Heterogeneous AI WorkloadsTianze Wang, Yifei Liu, Chen Chen et al.
Modern AI clusters, which host diverse workloads like data pre-processing, training and inference, often store the large-volume data in cloud storage and employ caching frameworks to facilitate remote data access. To avoid code-intrusion complexity and minimize cache space wastage, it is desirable to maintain a unified cache shared by all the workloads. However, existing cache management strategies, designed for specific workloads, struggle to handle the heterogeneous AI workloads in a cluster -- which usually exhibit heterogeneous access patterns and item storage granularities. In this paper, we propose IGTCache, a unified, high-efficacy cache for modern AI clusters. IGTCache leverages a hierarchical access abstraction, AccessStreamTree, to organize the recent data accesses in a tree structure, facilitating access pattern detection at various granularities. Using this abstraction, IGTCache applies hypothesis testing to categorize data access patterns as sequential, random, or skewed. Based on these detected access patterns and granularities, IGTCache tailors optimal cache management strategies including prefetching, eviction, and space allocation accordingly. Experimental results show that IGTCache increases the cache hit ratio by 55.6% over state-of-the-art caching frameworks, reducing the overall job completion time by 52.2%.
LGFeb 7, 2025
Unified Approaches in Self-Supervised Event Stream Modeling: Progress and ProspectsLevente Zólyomi, Tianze Wang, Sofiane Ennadir et al.
The proliferation of digital interactions across diverse domains, such as healthcare, e-commerce, gaming, and finance, has resulted in the generation of vast volumes of event stream (ES) data. ES data comprises continuous sequences of timestamped events that encapsulate detailed contextual information relevant to each domain. While ES data holds significant potential for extracting actionable insights and enhancing decision-making, its effective utilization is hindered by challenges such as the scarcity of labeled data and the fragmented nature of existing research efforts. Self-Supervised Learning (SSL) has emerged as a promising paradigm to address these challenges by enabling the extraction of meaningful representations from unlabeled ES data. In this survey, we systematically review and synthesize SSL methodologies tailored for ES modeling across multiple domains, bridging the gaps between domain-specific approaches that have traditionally operated in isolation. We present a comprehensive taxonomy of SSL techniques, encompassing both predictive and contrastive paradigms, and analyze their applicability and effectiveness within different application contexts. Furthermore, we identify critical gaps in current research and propose a future research agenda aimed at developing scalable, domain-agnostic SSL frameworks for ES modeling. By unifying disparate research efforts and highlighting cross-domain synergies, this survey aims to accelerate innovation, improve reproducibility, and expand the applicability of SSL to diverse real-world ES challenges.
LGOct 24, 2024
Understanding Players as if They Are Talking to the Game in a Customized Language: A Pilot StudyTianze Wang, Maryam Honari-Jahromi, Styliani Katsarou et al.
This pilot study explores the application of language models (LMs) to model game event sequences, treating them as a customized natural language. We investigate a popular mobile game, transforming raw event data into textual sequences and pretraining a Longformer model on this data. Our approach captures the rich and nuanced interactions within game sessions, effectively identifying meaningful player segments. The results demonstrate the potential of self-supervised LMs in enhancing game design and personalization without relying on ground-truth labels.
LGJan 21, 2022
Accelerate Model Parallel Training by Using Efficient Graph Traversal Order in Device PlacementTianze Wang, Amir H. Payberah, Desta Haileselassie Hagos et al.
Modern neural networks require long training to reach decent performance on massive datasets. One common approach to speed up training is model parallelization, where large neural networks are split across multiple devices. However, different device placements of the same neural network lead to different training times. Most of the existing device placement solutions treat the problem as sequential decision-making by traversing neural network graphs and assigning their neurons to different devices. This work studies the impact of graph traversal order on device placement. In particular, we empirically study how different graph traversal order leads to different device placement, which in turn affects the training execution time. Our experiment results show that the best graph traversal order depends on the type of neural networks and their computation graphs features. In this work, we also provide recommendations on choosing graph traversal order in device placement for various neural network families to improve the training time in model parallelization.