Xu Liu

CV
h-index98
143papers
4,700citations
Novelty44%
AI Score59

143 Papers

LGJun 14, 2023Code
LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting

Xu Liu, Yutong Xia, Yuxuan Liang et al.

Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning in capturing non-linear patterns of traffic data. However, the promising results achieved on current public datasets may not be applicable to practical scenarios due to limitations within these datasets. First, the limited sizes of them may not reflect the real-world scale of traffic networks. Second, the temporal coverage of these datasets is typically short, posing hurdles in studying long-term patterns and acquiring sufficient samples for training deep models. Third, these datasets often lack adequate metadata for sensors, which compromises the reliability and interpretability of the data. To mitigate these limitations, we introduce the LargeST benchmark dataset. It encompasses a total number of 8,600 sensors in California with a 5-year time coverage and includes comprehensive metadata. Using LargeST, we perform in-depth data analysis to extract data insights, benchmark well-known baselines in terms of their performance and efficiency, and identify challenges as well as opportunities for future research. We release the datasets and baseline implementations at: https://github.com/liuxu77/LargeST.

CVDec 6, 2022Code
Event-based Monocular Dense Depth Estimation with Recurrent Transformers

Xu Liu, Jianing Li, Xiaopeng Fan et al.

Event cameras, offering high temporal resolutions and high dynamic ranges, have brought a new perspective to address common challenges (e.g., motion blur and low light) in monocular depth estimation. However, how to effectively exploit the sparse spatial information and rich temporal cues from asynchronous events remains a challenging endeavor. To this end, we propose a novel event-based monocular depth estimator with recurrent transformers, namely EReFormer, which is the first pure transformer with a recursive mechanism to process continuous event streams. Technically, for spatial modeling, a novel transformer-based encoder-decoder with a spatial transformer fusion module is presented, having better global context information modeling capabilities than CNN-based methods. For temporal modeling, we design a gate recurrent vision transformer unit that introduces a recursive mechanism into transformers, improving temporal modeling capabilities while alleviating the expensive GPU memory cost. The experimental results show that our EReFormer outperforms state-of-the-art methods by a margin on both synthetic and real-world datasets. We hope that our work will attract further research to develop stunning transformers in the event-based vision community. Our open-source code can be found in the supplemental material.

LGApr 27, 2023Code
TorchBench: Benchmarking PyTorch with High API Surface Coverage

Yueming Hao, Xu Zhao, Bin Bao et al.

Deep learning (DL) has been a revolutionary technique in various domains. To facilitate the model development and deployment, many deep learning frameworks are proposed, among which PyTorch is one of the most popular solutions. The performance of ecosystem around PyTorch is critically important, which saves the costs of training models and reduces the response time of model inferences. In this paper, we propose TorchBench, a novel benchmark suite to study the performance of PyTorch software stack. Unlike existing benchmark suites, TorchBench encloses many representative models, covering a large PyTorch API surface. TorchBench is able to comprehensively characterize the performance of the PyTorch software stack, guiding the performance optimization across models, PyTorch framework, and GPU libraries. We show two practical use cases of TorchBench. (1) We profile TorchBench to identify GPU performance inefficiencies in PyTorch. We are able to optimize many performance bugs and upstream patches to the official PyTorch repository. (2) We integrate TorchBench into PyTorch continuous integration system. We are able to identify performance regression in multiple daily code checkins to prevent PyTorch repository from introducing performance bugs. TorchBench is open source and keeps evolving.

99.8ROApr 13Code
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie et al.

Despite the critical role of bimanual manipulation in endowing robots with human-like dexterity, large-scale and diverse datasets remain scarce due to the significant hardware heterogeneity across bimanual robotic platforms. To bridge this gap, we introduce RoboCOIN, a large-scale multi-embodiment bimanual manipulation dataset comprising over 180,000 demonstrations collected from 15 distinct robotic platforms. Spanning 16 diverse environments-including residential, commercial, and industrial settings-the dataset features 421 bimanual tasks systematically categorized by 39 bimanual collaboration actions and 432 objects. A key innovation of our work is the hierarchical capability pyramid, which provides granular annotations ranging from trajectory-level concepts to segment-level subtasks and frame-level kinematics. Furthermore, we present CoRobot, an efficient data processing pipeline powered by the Robot Trajectory Markup Language (RTML), designed to facilitate quality assessment, automated annotation, and unified multi-embodiment and data management. Extensive experiments demonstrate the effectiveness of RoboCOIN in enhancing the performance of various bimanual manipulation models across a wide spectrum of robotic embodiments. The entire dataset and codebase are fully open-sourced, providing a valuable resource for advancing research in bimanual and multi-embodiment manipulation.

CLJul 25, 2023
Evaluating Large Language Models for Radiology Natural Language Processing

Zhengliang Liu, Tianyang Zhong, Yiwei Li et al.

The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.

50.4CVMay 28
Turbulence-Robust Dynamic Object Segmentation with Multi-Signal Priors and SAM2 Refinement

Bolian Peng, Ying Tang, Xu Liu et al.

This technical report presents our solution for the CVPR 2026 UG2+ Challenge Track 3: Dynamic Object Segmentation in Turbulence (DOST). We design a training-free multi-signal segmentation pipeline that combines pretrained motion estimation, self-supervised semantic priors, background anomaly modeling, manually calibrated proposal fusion, and SAM2-based mask refinement. The method uses RAFT for dense motion responses, DINOv2 for semantic objectness priors, ViBe for training-free background modeling, and pretrained SAM2 for box-prompt mask refinement. Instead of optimizing an end-to-end segmentation network, our system operates entirely in inference mode. This design is suitable for the DOST setting, where severe atmospheric turbulence produces pseudo-motion, blur, and intermittent target visibility, making a single motion cue unreliable. The final submitted masks are evaluated by the official leaderboard, which reports 0.425041 mIoU and 0.457206 mDice. Since no task-specific model training or fine-tuning is performed, stronger learned temporal association, adaptive proposal selection, or task-specific adaptation may further improve the system.

CVMar 24, 2022
Transformers Meet Visual Learning Understanding: A Comprehensive Review

Yuting Yang, Licheng Jiao, Xu Liu et al.

Dynamic attention mechanism and global modeling ability make Transformer show strong feature learning ability. In recent years, Transformer has become comparable to CNNs methods in computer vision. This review mainly investigates the current research progress of Transformer in image and video applications, which makes a comprehensive overview of Transformer in visual learning understanding. First, the attention mechanism is reviewed, which plays an essential part in Transformer. And then, the visual Transformer model and the principle of each module are introduced. Thirdly, the existing Transformer-based models are investigated, and their performance is compared in visual learning understanding applications. Three image tasks and two video tasks of computer vision are investigated. The former mainly includes image classification, object detection, and image segmentation. The latter contains object tracking and video classification. It is significant for comparing different models' performance in various tasks on several public benchmark data sets. Finally, ten general problems are summarized, and the developing prospects of the visual Transformer are given in this review.

CVDec 30, 2022
Delving into Semantic Scale Imbalance

Yanbiao Ma, Licheng Jiao, Fang Liu et al.

Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias.

LGSep 23, 2023
Deciphering Spatio-Temporal Graph Forecasting: A Causal Lens and Treatment

Yutong Xia, Yuxuan Liang, Haomin Wen et al.

Spatio-Temporal Graph (STG) forecasting is a fundamental task in many real-world applications. Spatio-Temporal Graph Neural Networks have emerged as the most popular method for STG forecasting, but they often struggle with temporal out-of-distribution (OoD) issues and dynamic spatial causation. In this paper, we propose a novel framework called CaST to tackle these two challenges via causal treatments. Concretely, leveraging a causal lens, we first build a structural causal model to decipher the data generation process of STGs. To handle the temporal OoD issue, we employ the back-door adjustment by a novel disentanglement block to separate invariant parts and temporal environments from input data. Moreover, we utilize the front-door adjustment and adopt the Hodge-Laplacian operator for edge-level convolution to model the ripple effect of causation. Experiments results on three real-world datasets demonstrate the effectiveness and practicality of CaST, which consistently outperforms existing methods with good interpretability.

LGMay 14, 2022
Bayesian Physics-Informed Extreme Learning Machine for Forward and Inverse PDE Problems with Noisy Data

Xu Liu, Wen Yao, Wei Peng et al.

Physics-informed extreme learning machine (PIELM) has recently received significant attention as a rapid version of physics-informed neural network (PINN) for solving partial differential equations (PDEs). The key characteristic is to fix the input layer weights with random values and use Moore-Penrose generalized inverse for the output layer weights. The framework is effective, but it easily suffers from overfitting noisy data and lacks uncertainty quantification for the solution under noise scenarios.To this end, we develop the Bayesian physics-informed extreme learning machine (BPIELM) to solve both forward and inverse linear PDE problems with noisy data in a unified framework. In our framework, a prior probability distribution is introduced in the output layer for extreme learning machine with physic laws and the Bayesian method is used to estimate the posterior of parameters. Besides, for inverse PDE problems, problem parameters considered as new output layer weights are unified in a framework with forward PDE problems. Finally, we demonstrate BPIELM considering both forward problems, including Poisson, advection, and diffusion equations, as well as inverse problems, where unknown problem parameters are estimated. The results show that, compared with PIELM, BPIELM quantifies uncertainty arising from noisy data and provides more accurate predictions. In addition, BPIELM is considerably cheaper than PINN in terms of the computational cost.

78.1AIMay 27
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

Haitian Li, Yanghao Zhou, Heyan Huang et al.

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

CVSep 23, 2024
AIM 2024 Sparse Neural Rendering Challenge: Methods and Results

Michal Nazarczuk, Sibi Catley-Chandar, Thomas Tanay et al.

This paper reviews the challenge on Sparse Neural Rendering that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. This manuscript focuses on the competition set-up, the proposed methods and their respective results. The challenge aims at producing novel camera view synthesis of diverse scenes from sparse image observations. It is composed of two tracks, with differing levels of sparsity; 3 views in Track 1 (very sparse) and 9 views in Track 2 (sparse). Participants are asked to optimise objective fidelity to the ground-truth images as measured via the Peak Signal-to-Noise Ratio (PSNR) metric. For both tracks, we use the newly introduced Sparse Rendering (SpaRe) dataset and the popular DTU MVS dataset. In this challenge, 5 teams submitted final results to Track 1 and 4 teams submitted final results to Track 2. The submitted models are varied and push the boundaries of the current state-of-the-art in sparse neural rendering. A detailed description of all models developed in the challenge is provided in this paper.

CLMar 27, 2023
Coupling Artificial Neurons in BERT and Biological Neurons in the Human Brain

Xu Liu, Mengyue Zhou, Gaosheng Shi et al.

Linking computational natural language processing (NLP) models and neural responses to language in the human brain on the one hand facilitates the effort towards disentangling the neural representations underpinning language perception, on the other hand provides neurolinguistics evidence to evaluate and improve NLP models. Mappings of an NLP model's representations of and the brain activities evoked by linguistic input are typically deployed to reveal this symbiosis. However, two critical problems limit its advancement: 1) The model's representations (artificial neurons, ANs) rely on layer-level embeddings and thus lack fine-granularity; 2) The brain activities (biological neurons, BNs) are limited to neural recordings of isolated cortical unit (i.e., voxel/region) and thus lack integrations and interactions among brain functions. To address those problems, in this study, we 1) define ANs with fine-granularity in transformer-based NLP models (BERT in this study) and measure their temporal activations to input text sequences; 2) define BNs as functional brain networks (FBNs) extracted from functional magnetic resonance imaging (fMRI) data to capture functional interactions in the brain; 3) couple ANs and BNs by maximizing the synchronization of their temporal activations. Our experimental results demonstrate 1) The activations of ANs and BNs are significantly synchronized; 2) the ANs carry meaningful linguistic/semantic information and anchor to their BN signatures; 3) the anchored BNs are interpretable in a neurolinguistic context. Overall, our study introduces a novel, general, and effective framework to link transformer-based NLP models and neural activities in response to language and may provide novel insights for future studies such as brain-inspired evaluation and development of NLP models.

LGNov 27, 2022
Deep Active Learning for Computer Vision: Past and Future

Rinyoichi Takezoe, Xu Liu, Shunan Mao et al.

As an important data selection schema, active learning emerges as the essential component when iterating an Artificial Intelligence (AI) model. It becomes even more critical given the dominance of deep neural network based models, which are composed of a large number of parameters and data hungry, in application. Despite its indispensable role for developing AI models, research on active learning is not as intensive as other research directions. In this paper, we present a review of active learning through deep active learning approaches from the following perspectives: 1) technical advancements in active learning, 2) applications of active learning in computer vision, 3) industrial systems leveraging or with potential to leverage active learning for data iteration, 4) current limitations and future research directions. We expect this paper to clarify the significance of active learning in a modern AI model manufacturing process and to bring additional research attention to active learning. By addressing data automation challenges and coping with automated machine learning systems, active learning will facilitate democratization of AI technologies by boosting model production at scale.

CVDec 2, 2022
DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation

Haoran Wei, Xu Liu, Shouchun Xu et al.

Many current works directly adopt multi-rate depth-wise dilated convolutions to capture multi-scale contextual information simultaneously from one input feature map, thus improving the feature extraction efficiency for real-time semantic segmentation. However, this design may lead to difficult access to multi-scale contextual information because of the unreasonable structure and hyperparameters. To lower the difficulty of drawing multi-scale contextual information, we propose a highly efficient multi-scale feature extraction method, which decomposes the original single-step method into two steps, Region Residualization-Semantic Residualization. In this method, the multi-rate depth-wise dilated convolutions take a simpler role in feature extraction: performing simple semantic-based morphological filtering with one desired receptive field in the second step based on each concise feature map of region form provided by the first step, to improve their efficiency. Moreover, the dilation rates and the capacity of dilated convolutions for each network stage are elaborated to fully utilize all the feature maps of region form that can be achieved.Accordingly, we design a novel Dilation-wise Residual (DWR) module and a Simple Inverted Residual (SIR) module for the high and low level network, respectively, and form a powerful DWR Segmentation (DWRSeg) network. Extensive experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a state-of-the-art trade-off between accuracy and inference speed, in addition to being lighter weight. Without pretraining or resorting to any training trick, we achieve an mIoU of 72.7% on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which exceeds the latest methods of a speed of 69.5 FPS and 0.8% mIoU. The code and trained models are publicly available.

LGOct 15, 2023
UniTime: A Language-Empowered Unified Model for Cross-Domain Time Series Forecasting

Xu Liu, Junfeng Hu, Yuan Li et al.

Multivariate time series forecasting plays a pivotal role in contemporary web technologies. In contrast to conventional methods that involve creating dedicated models for specific time series application domains, this research advocates for a unified model paradigm that transcends domain boundaries. However, learning an effective cross-domain model presents the following challenges. First, various domains exhibit disparities in data characteristics, e.g., the number of variables, posing hurdles for existing models that impose inflexible constraints on these factors. Second, the model may encounter difficulties in distinguishing data from various domains, leading to suboptimal performance in our assessments. Third, the diverse convergence rates of time series domains can also result in compromised empirical performance. To address these issues, we propose UniTime for effective cross-domain time series learning. Concretely, UniTime can flexibly adapt to data with varying characteristics. It also uses domain instructions and a Language-TS Transformer to offer identification information and align two modalities. In addition, UniTime employs masking to alleviate domain convergence speed imbalance issues. Our extensive experiments demonstrate the effectiveness of UniTime in advancing state-of-the-art forecasting performance and zero-shot transferability.

CVMar 22, 2023
Predicting and Enhancing the Fairness of DNNs with the Curvature of Perceptual Manifolds

Yanbiao Ma, Licheng Jiao, Fang Liu et al.

To address the challenges of long-tailed classification, researchers have proposed several approaches to reduce model bias, most of which assume that classes with few samples are weak classes. However, recent studies have shown that tail classes are not always hard to learn, and model bias has been observed on sample-balanced datasets, suggesting the existence of other factors that affect model bias. In this work, we first establish a geometric perspective for analyzing model fairness and then systematically propose a series of geometric measurements for perceptual manifolds in deep neural networks. Subsequently, we comprehensively explore the effect of the geometric characteristics of perceptual manifolds on classification difficulty and how learning shapes the geometric characteristics of perceptual manifolds. An unanticipated finding is that the correlation between the class accuracy and the separation degree of perceptual manifolds gradually decreases during training, while the negative correlation with the curvature gradually increases, implying that curvature imbalance leads to model bias.Building upon these observations, we propose curvature regularization to facilitate the model to learn curvature-balanced and flatter perceptual manifolds. Evaluations on multiple long-tailed and non-long-tailed datasets show the excellent performance and exciting generality of our approach, especially in achieving significant performance improvements based on current state-of-the-art techniques. Our work opens up a geometric analysis perspective on model bias and reminds researchers to pay attention to model bias on non-long-tailed and even sample-balanced datasets.

CVFeb 26Code
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Bowen Cui, Yuanbin Wang, Huajiang Xu et al.

Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.

CVJul 29, 2022
Deep Learning-based Occluded Person Re-identification: A Survey

Yunjie Peng, Saihui Hou, Chunshui Cao et al.

Occluded person re-identification (Re-ID) aims at addressing the occlusion problem when retrieving the person of interest across multiple cameras. With the promotion of deep learning technology and the increasing demand for intelligent video surveillance, the frequent occlusion in real-world applications has made occluded person Re-ID draw considerable interest from researchers. A large number of occluded person Re-ID methods have been proposed while there are few surveys that focus on occlusion. To fill this gap and help boost future research, this paper provides a systematic survey of occluded person Re-ID. Through an in-depth analysis of the occlusion in person Re-ID, most existing methods are found to only consider part of the problems brought by occlusion. Therefore, we review occlusion-related person Re-ID methods from the perspective of issues and solutions. We summarize four issues caused by occlusion in person Re-ID, i.e., position misalignment, scale misalignment, noisy information, and missing information. The occlusion-related methods addressing different issues are then categorized and introduced accordingly. After that, we summarize and compare the performance of recent occluded person Re-ID methods on four popular datasets: Partial-ReID, Partial-iLIDS, Occluded-ReID, and Occluded-DukeMTMC. Finally, we provide insights on promising future research directions.

LGJan 30, 2023
Do We Really Need Graph Neural Networks for Traffic Forecasting?

Xu Liu, Yuxuan Liang, Chao Huang et al.

Spatio-temporal graph neural networks (STGNN) have become the most popular solution to traffic forecasting. While successful, they rely on the message passing scheme of GNNs to establish spatial dependencies between nodes, and thus inevitably inherit GNNs' notorious inefficiency. Given these facts, in this paper, we propose an embarrassingly simple yet remarkably effective spatio-temporal learning approach, entitled SimST. Specifically, SimST approximates the efficacies of GNNs by two spatial learning techniques, which respectively model local and global spatial correlations. Moreover, SimST can be used alongside various temporal models and involves a tailored training strategy. We conduct experiments on five traffic benchmarks to assess the capability of SimST in terms of efficiency and effectiveness. Empirical results show that SimST improves the prediction throughput by up to 39 times compared to more sophisticated STGNNs while attaining comparable performance, which indicates that GNNs are not the only option for spatial modeling in traffic forecasting.

NEFeb 6, 2023
Bi-level Multi-objective Evolutionary Learning: A Case Study on Multi-task Graph Neural Topology Search

Chao Wang, Licheng Jiao, Jiaxuan Zhao et al.

The construction of machine learning models involves many bi-level multi-objective optimization problems (BL-MOPs), where upper level (UL) candidate solutions must be evaluated via training weights of a model in the lower level (LL). Due to the Pareto optimality of sub-problems and the complex dependency across UL solutions and LL weights, an UL solution is feasible if and only if the LL weight is Pareto optimal. It is computationally expensive to determine which LL Pareto weight in the LL Pareto weight set is the most appropriate for each UL solution. This paper proposes a bi-level multi-objective learning framework (BLMOL), coupling the above decision-making process with the optimization process of the UL-MOP by introducing LL preference $r$. Specifically, the UL variable and $r$ are simultaneously searched to minimize multiple UL objectives by evolutionary multi-objective algorithms. The LL weight with respect to $r$ is trained to minimize multiple LL objectives via gradient-based preference multi-objective algorithms. In addition, the preference surrogate model is constructed to replace the expensive evaluation process of the UL-MOP. We consider a novel case study on multi-task graph neural topology search. It aims to find a set of Pareto topologies and their Pareto weights, representing different trade-offs across tasks at UL and LL, respectively. The found graph neural network is employed to solve multiple tasks simultaneously, including graph classification, node classification, and link prediction. Experimental results demonstrate that BLMOL can outperform some state-of-the-art algorithms and generate well-representative UL solutions and LL weights.

50.0CVMay 30
CASTLE2026 Team WDL Technical Report

Zhengyang Li, Zhenglin Du, Yi Wen et al.

The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.

LGOct 16, 2023
Orthogonal Uncertainty Representation of Data Manifold for Robust Long-Tailed Learning

Yanbiao Ma, Licheng Jiao, Fang Liu et al.

In scenarios with long-tailed distributions, the model's ability to identify tail classes is limited due to the under-representation of tail samples. Class rebalancing, information augmentation, and other techniques have been proposed to facilitate models to learn the potential distribution of tail classes. The disadvantage is that these methods generally pursue models with balanced class accuracy on the data manifold, while ignoring the ability of the model to resist interference. By constructing noisy data manifold, we found that the robustness of models trained on unbalanced data has a long-tail phenomenon. That is, even if the class accuracy is balanced on the data domain, it still has bias on the noisy data manifold. However, existing methods cannot effectively mitigate the above phenomenon, which makes the model vulnerable in long-tailed scenarios. In this work, we propose an Orthogonal Uncertainty Representation (OUR) of feature embedding and an end-to-end training strategy to improve the long-tail phenomenon of model robustness. As a general enhancement tool, OUR has excellent compatibility with other methods and does not require additional data generation, ensuring fast and efficient training. Comprehensive evaluations on long-tailed datasets show that our method significantly improves the long-tail phenomenon of robustness, bringing consistent performance gains to other long-tailed learning methods.

CVAug 22, 2023
Free Lunch for Gait Recognition: A Novel Relation Descriptor

Jilong Wang, Saihui Hou, Yan Huang et al.

Gait recognition is to seek correct matches for query individuals by their unique walking patterns. However, current methods focus solely on extracting individual-specific features, overlooking ``interpersonal" relationships. In this paper, we propose a novel $\textbf{Relation Descriptor}$ that captures not only individual features but also relations between test gaits and pre-selected gait anchors. Specifically, we reinterpret classifier weights as gait anchors and compute similarity scores between test features and these anchors, which re-expresses individual gait features into a similarity relation distribution. In essence, the relation descriptor offers a holistic perspective that leverages the collective knowledge stored within the classifier's weights, emphasizing meaningful patterns and enhancing robustness. Despite its potential, relation descriptor poses dimensionality challenges since its dimension depends on the training set's identity count. To address this, we propose Farthest gait-Anchor Selection to identify the most discriminative gait anchors and an Orthogonal Regularization Loss to increase diversity within gait anchors. Compared to individual-specific features extracted from the backbone, our relation descriptor can boost the performance nearly without any extra costs. We evaluate the effectiveness of our method on the popular GREW, Gait3D, OU-MVLP, CASIA-B, and CCPG, showing that our method consistently outperforms the baselines and achieves state-of-the-art performance.

SYJun 15, 2022
Reliability Analysis of Complex Multi-State System Based on Universal Generating Function and Bayesian Network

Xu Liu, Wen Yao, Xiaohu Zheng et al.

In the complex multi-state system (MSS), reliability analysis is a significant research content, both for equipment design, manufacturing, usage and maintenance. Universal Generating Function (UGF) is an important method in the reliability analysis, which efficiently obtains the system reliability by a fast algebraic procedure. However, when structural relationships between subsystems or components are not clear or without explicit expressions, the UGF method is difficult to use or not applicable at all. Bayesian Network (BN) has a natural advantage in terms of uncertainty inference for the relationship without explicit expressions. For the number of components is extremely large, though, it has the defects of low efficiency. To overcome the respective defects of UGF and BN, a novel reliability analysis method called UGF-BN is proposed for the complex MSS. In the UGF-BN framework, the UGF method is firstly used to analyze the bottom components with a large number. Then probability distributions obtained are taken as the input of BN. Finally, the reliability of the complex MSS is modeled by the BN method. This proposed method improves the computational efficiency, especially for the MSS with the large number of bottom components. Besides, the aircraft reliability-based design optimization based on the UGF-BN method is further studied with budget constraints on mass, power, and cost. Finally, two cases are used to demonstrate and verify the proposed method.

LGNov 10, 2025Code
A Closer Look at Knowledge Distillation in Spiking Neural Network Training

Xu Liu, Na Xia, Jinxing Zhou et al.

Spiking Neural Networks (SNNs) become popular due to excellent energy efficiency, yet facing challenges for effective model training. Recent works improve this by introducing knowledge distillation (KD) techniques, with the pre-trained artificial neural networks (ANNs) used as teachers and the target SNNs as students. This is commonly accomplished through a straightforward element-wise alignment of intermediate features and prediction logits from ANNs and SNNs, often neglecting the intrinsic differences between their architectures. Specifically, ANN's outputs exhibit a continuous distribution, whereas SNN's outputs are characterized by sparsity and discreteness. To mitigate this issue, we introduce two innovative KD strategies. Firstly, we propose the Saliency-scaled Activation Map Distillation (SAMD), which aligns the spike activation map of the student SNN with the class-aware activation map of the teacher ANN. Rather than performing KD directly on the raw %and distinct features of ANN and SNN, our SAMD directs the student to learn from saliency activation maps that exhibit greater semantic and distribution consistency. Additionally, we propose a Noise-smoothed Logits Distillation (NLD), which utilizes Gaussian noise to smooth the sparse logits of student SNN, facilitating the alignment with continuous logits from teacher ANN. Extensive experiments on multiple datasets demonstrate the effectiveness of our methods. Code is available~\footnote{https://github.com/SinoLeu/CKDSNN.git}.

94.9CVApr 11Code
SinkTrack: Attention Sink based Context Anchoring for Large Language Models

Xu Liu, Guikun Chen, Wenguan Wang

Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs' focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink -- the tendency to consistently allocate high attention to the very first token (i.e., <BOS>) of a sequence. Concretely, we propose an advanced context anchoring method, SinkTrack, which treats <BOS> as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SinkTrack is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SinkTrack mitigates hallucination and context forgetting across both textual (e.g., +21.6% on SQuAD2.0 with Llama3.1-8B-Instruct) and multi-modal (e.g., +22.8% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at https://github.com/67L1/SinkTrack.

CVMar 19, 2023
Unsupervised Gait Recognition with Selective Fusion

Xuqian Ren, Shaopeng Yang, Saihui Hou et al.

Previous gait recognition methods primarily trained on labeled datasets, which require painful labeling effort. However, using a pre-trained model on a new dataset without fine-tuning can lead to significant performance degradation. So to make the pre-trained gait recognition model able to be fine-tuned on unlabeled datasets, we propose a new task: Unsupervised Gait Recognition (UGR). We introduce a new cluster-based baseline to solve UGR with cluster-level contrastive learning. But we further find more challenges this task meets. First, sequences of the same person in different clothes tend to cluster separately due to the significant appearance changes. Second, sequences taken from 0° and 180° views lack walking postures and do not cluster with sequences taken from other views. To address these challenges, we propose a Selective Fusion method, which includes Selective Cluster Fusion (SCF) and Selective Sample Fusion (SSF). With SCF, we merge matched clusters of the same person wearing different clothes by updating the cluster-level memory bank with a multi-cluster update strategy. And in SSF, we merge sequences taken from front/back views gradually with curriculum learning. Extensive experiments show the effectiveness of our method in improving the rank-1 accuracy in walking with different coats condition and front/back views conditions.

CVJul 24, 2022
Progressive Feature Learning for Realistic Cloth-Changing Gait Recognition

Xuqian Ren, Saihui Hou, Chunshui Cao et al.

Gait recognition is instrumental in crime prevention and social security, for it can be conducted at a long distance to figure out the identity of persons. However, existing datasets and methods cannot satisfactorily deal with the most challenging cloth-changing problem in practice. Specifically, the practical gait models are usually trained on automatically labeled data, in which the sequences' views and cloth conditions of each person have some restrictions. To be concrete, the cross-view sub-dataset only has normal walking condition without cloth-changing, while the cross-cloth sub-dataset has cloth-changing sequences but only in front views. As a result, the cloth-changing accuracy cannot meet practical requirements. In this work, we formulate the problem as Realistic Cloth-Changing Gait Recognition (abbreviated as RCC-GR) and we construct two benchmarks: CASIA-BN-RCC and OUMVLP-RCC, to simulate the above setting. Furthermore, we propose a new framework called Progressive Feature Learning that can be applied with off-the-shelf backbones to improve their performance in RCC-GR. Specifically, in our framework, we design Progressive Mapping and Progressive Uncertainty to extract cross-view features and then extract cross-cloth features on the basis. In this way, the feature from the cross-view sub-dataset can first dominate the feature space and relieve the uneven distribution caused by the adverse effect from the cross-cloth sub-dataset. The experiments on our benchmarks show that our framework can effectively improve recognition performance, especially in the cloth-changing conditions.

CVJul 1, 2024
Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Zihan Gao, Lingling Li, Licheng Jiao et al.

Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enable open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. However, while effective, these methods typically rely on the per-pixel distillation of high-dimensional CLIP features, introducing ambiguity and necessitating complex regularization strategies, which adds inefficiency during training. This paper presents MaskField, which enables efficient 3D open-vocabulary segmentation with neural fields from a novel perspective. Unlike previous methods, MaskField decomposes the distillation of mask and semantic features from foundation models by formulating a mask feature field and queries. MaskField overcomes ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of dense high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

19.7CVApr 14
4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview

Benjamin Kiefer, Jan Lukas Augustin, Jon Muhovič et al.

The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons learned across the benchmark suite. Datasets, leaderboards, and challenge resources are available at https://macvi.org/workshop/cvpr26.

CVNov 3, 2023
Data-Centric Long-Tailed Image Recognition

Yanbiao Ma, Licheng Jiao, Fang Liu et al.

In the context of the long-tail scenario, models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. Among these approaches, information augmentation has been progressively introduced as a crucial category. It achieves a balance in model performance by augmenting the richness and quantity of samples in the tail classes. However, there is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation methods. Consequently, the utilization of information augmentation in long-tail recognition tasks relies heavily on empirical and intricate fine-tuning. This work makes two primary contributions. Firstly, we approach the problem from the perspectives of feature diversity and distribution shift, introducing the concept of Feature Diversity Gain (FDG) to elucidate why information augmentation is effective. We find that the performance of information augmentation can be explained by FDG, and its performance peaks when FDG achieves an appropriate balance. Experimental results demonstrate that by using FDG to select augmented data, we can further enhance model performance without the need for any modifications to the model's architecture. Thus, data-centric approaches hold significant potential in the field of long-tail recognition, beyond the development of new model structures. Furthermore, we systematically introduce the core components and fundamental tasks of a data-centric long-tail learning framework for the first time. These core components guide the implementation and deployment of the system, while the corresponding fundamental tasks refine and expand the research area.

LGSep 27, 2024
A physics-driven sensor placement optimization methodology for temperature field reconstruction

Xu Liu, Wen Yao, Wei Peng et al.

Perceiving the global field from sparse sensors has been a grand challenge in the monitoring, analysis, and design of physical systems. In this context, sensor placement optimization is a crucial issue. Most existing works require large and sufficient data to construct data-based criteria, which are intractable in data-free scenarios without numerical and experimental data. To this end, we propose a novel physics-driven sensor placement optimization (PSPO) method for temperature field reconstruction using a physics-based criterion to optimize sensor locations. In our methodological framework, we firstly derive the theoretical upper and lower bounds of the reconstruction error under noise scenarios by analyzing the optimal solution, proving that error bounds correlate with the condition number determined by sensor locations. Furthermore, the condition number, as the physics-based criterion, is used to optimize sensor locations by the genetic algorithm. Finally, the best sensors are validated by reconstruction models, including non-invasive end-to-end models, non-invasive reduced-order models, and physics-informed models. Experimental results, both on a numerical and an application case, demonstrate that the PSPO method significantly outperforms random and uniform selection methods, improving the reconstruction accuracy by nearly an order of magnitude. Moreover, the PSPO method can achieve comparable reconstruction accuracy to the existing data-driven placement optimization methods.

CVSep 9, 2024
Renormalized Connection for Scale-preferred Object Detection in Satellite Imagery

Fan Zhang, Lingling Li, Licheng Jiao et al.

Satellite imagery, due to its long-range imaging, brings with it a variety of scale-preferred tasks, such as the detection of tiny/small objects, making the precise localization and detection of small objects of interest a challenging task. In this article, we design a Knowledge Discovery Network (KDN) to implement the renormalization group theory in terms of efficient feature extraction. Renormalized connection (RC) on the KDN enables ``synergistic focusing'' of multi-scale features. Based on our observations of KDN, we abstract a class of RCs with different connection strengths, called n21C, and generalize it to FPN-based multi-branch detectors. In a series of FPN experiments on the scale-preferred tasks, we found that the ``divide-and-conquer'' idea of FPN severely hampers the detector's learning in the right direction due to the large number of large-scale negative samples and interference from background noise. Moreover, these negative samples cannot be eliminated by the focal loss function. The RCs extends the multi-level feature's ``divide-and-conquer'' mechanism of the FPN-based detectors to a wide range of scale-preferred tasks, and enables synergistic effects of multi-level features on the specific learning goal. In addition, interference activations in two aspects are greatly reduced and the detector learns in a more correct direction. Extensive experiments of 17 well-designed detection architectures embedded with n21s on five different levels of scale-preferred tasks validate the effectiveness and efficiency of the RCs. Especially the simplest linear form of RC, E421C performs well in all tasks and it satisfies the scaling property of RGT. We hope that our approach will transfer a large number of well-designed detectors from the computer vision community to the remote sensing community.

AISep 9, 2022
Task-Agnostic Learning to Accomplish New Tasks

Xianqi Zhang, Xingtao Wang, Xu Liu et al.

Reinforcement Learning (RL) and Imitation Learning (IL) have made great progress in robotic decision-making in recent years. However, these methods show obvious deterioration for new tasks that need to be completed through new combinations of actions. RL methods suffer from reward functions and distribution shifts, while IL methods are limited by expert demonstrations which do not cover new tasks. In contrast, humans can easily complete these tasks with the fragmented knowledge learned from task-agnostic experience. Inspired by this observation, this paper proposes a task-agnostic learning method (TAL for short) that can learn fragmented knowledge only from task-agnostic data to accomplish new tasks. TAL consists of four stages. First, the task-agnostic exploration is performed to collect data from interactions with the environment. The collected data is organized via a knowledge graph. Second, an action feature extractor is proposed and trained using the collected knowledge graph data for task-agnostic fragmented knowledge learning. Third, a candidate action generator is designed, which applies the action feature extractor on a new task to generate multiple candidate action sets. Finally, an action proposal network is designed to produce the probabilities for actions in a new task according to the environmental information. The probabilities are then used to generate order information for selecting actions to be executed from multiple candidate action sets to form the plan. Experiments on a virtual indoor scene show that the proposed method outperforms the state-of-the-art offline RL methods and IL methods by more than 20%.

LGSep 9, 2024
A general reduced-order neural operator for spatio-temporal predictive learning on complex spatial domains

Qinglu Meng, Yingguang Li, Zhiliang Deng et al.

Predictive learning for spatio-temporal processes (PL-STP) on complex spatial domains plays a critical role in various scientific and engineering fields, with its essence being the construction of operators between infinite-dimensional function spaces. This paper focuses on the unequal-domain mappings in PL-STP and categorising them into increase-domain and decrease-domain mapping. Recent advances in deep learning have revealed the great potential of neural operators (NOs) to learn operators directly from observational data. However, existing NOs require input space and output space to be the same domain, which pose challenges in ensuring predictive accuracy and stability for unequal-domain mappings. To this end, this study presents a general reduced-order neural operator named Reduced-Order Neural Operator on Riemannian Manifolds (RO-NORM), which consists of two parts: the unequal-domain encoder/decoder and the same-domain approximator. Motivated by the variable separation in classical modal decomposition, the unequal-domain encoder/decoder uses the pre-computed bases to reformulate the spatio-temporal function as a sum of products between spatial (or temporal) bases and corresponding temporally (or spatially) distributed weight functions, thus the original unequal-domain mapping can be converted into a same-domain mapping. Consequently, the same-domain approximator NORM is applied to model the transformed mapping. The performance of our proposed method has been evaluated on six benchmark cases, including parametric PDEs, engineering and biomedical applications, and compared with four baseline algorithms: DeepONet, POD-DeepONet, PCA-Net, and vanilla NORM. The experimental results demonstrate the superiority of RO-NORM in prediction accuracy and training efficiency for PL-STP.

LGOct 26, 2023
Towards Unifying Diffusion Models for Probabilistic Spatio-Temporal Graph Learning

Junfeng Hu, Xu Liu, Zhencheng Fan et al.

Spatio-temporal graph learning is a fundamental problem in modern urban systems. Existing approaches tackle different tasks independently, tailoring their models to unique task characteristics. These methods, however, fall short of modeling intrinsic uncertainties in the spatio-temporal data. Meanwhile, their specialized designs misalign with the current research efforts toward unifying spatio-temporal graph learning solutions. In this paper, we propose to model these tasks in a unified probabilistic perspective, viewing them as predictions based on conditional information with shared dependencies. Based on this proposal, we introduce Unified Spatio-Temporal Diffusion Models (USTD) to address the tasks uniformly under the uncertainty-aware diffusion framework. USTD is holistically designed, comprising a shared spatio-temporal encoder and attention-based denoising decoders that are task-specific. The encoder, optimized by pre-training strategies, effectively captures conditional spatio-temporal patterns. The decoders, utilizing attention mechanisms, generate predictions by leveraging learned patterns. Opting for forecasting and kriging, the decoders are designed as Spatial Gated Attention (SGA) and Temporal Gated Attention (TGA) for each task, with different emphases on the spatial and temporal dimensions. Combining the advantages of deterministic encoders and probabilistic decoders, USTD achieves state-of-the-art performances compared to both deterministic and probabilistic baselines, while also providing valuable uncertainty estimates.

NEJul 1, 2023
AutoST: Training-free Neural Architecture Search for Spiking Transformers

Ziqing Wang, Qidong Zhao, Jinku Cui et al.

Spiking Transformers have gained considerable attention because they achieve both the energy efficiency of Spiking Neural Networks (SNNs) and the high capacity of Transformers. However, the existing Spiking Transformer architectures, derived from Artificial Neural Networks (ANNs), exhibit a notable architectural gap, resulting in suboptimal performance compared to their ANN counterparts. Manually discovering optimal architectures is time-consuming. To address these limitations, we introduce AutoST, a training-free NAS method for Spiking Transformers, to rapidly identify high-performance Spiking Transformer architectures. Unlike existing training-free NAS methods, which struggle with the non-differentiability and high sparsity inherent in SNNs, we propose to utilize Floating-Point Operations (FLOPs) as a performance metric, which is independent of model computations and training dynamics, leading to a stronger correlation with performance. Our extensive experiments show that AutoST models outperform state-of-the-art manually or automatically designed SNN architectures on static and neuromorphic datasets. Full code, model, and data are released for reproduction.

59.4AIApr 13
Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models

Ruiyang Li, Fang Liu, Licheng Jiao et al.

Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks.

LGNov 12, 2025
Moirai 2.0: When Less Is More for Time Series Forecasting

Chenghao Liu, Taha Aksu, Juncheng Liu et al.

We introduce Moirai 2.0, a decoder-only time-series foundation model trained on a new corpus of 36M series. The model adopts quantile forecasting and multi-token prediction, improving both probabilistic accuracy and inference efficiency. On the Gift-Eval benchmark, it ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size. Compared to Moirai 1.0, Moirai 2.0 replaces masked-encoder training, multi-patch inputs, and mixture-distribution outputs with a simpler decoder-only architecture, single patch, and quantile loss. Ablation studies isolate these changes -- showing that the decoder-only backbone along with recursive multi-quantile decoding contribute most to the gains. Additional experiments show that Moirai 2.0 outperforms larger models from the same family and exhibits robust domain-level results. In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large, while also performing better. Model performance plateaus with increasing parameter count and declines at longer horizons, motivating future work on data scaling and long-horizon modeling. We release code and evaluation details to support further research.

CVSep 9, 2024
LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Henghui Ding, Lingyi Hong, Chang Liu et al.

Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube-VOS and YouTube-RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage https://lsvos.github.io/.

LGOct 14, 2024Code
GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation

Taha Aksu, Gerald Woo, Juncheng Liu et al.

Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training. However, the advancement of these models has been hindered by the lack of comprehensive benchmarks. To address this gap, we introduce the General Time Series Forecasting Model Evaluation, GIFT-Eval, a pioneering benchmark aimed at promoting evaluation across diverse datasets. GIFT-Eval encompasses 23 datasets over 144,000 time series and 177 million data points, spanning seven domains, 10 frequencies, multivariate inputs, and prediction lengths ranging from short to long-term forecasts. To facilitate the effective pretraining and evaluation of foundation models, we also provide a non-leaking pretraining dataset containing approximately 230 billion data points. Additionally, we provide a comprehensive analysis of 17 baselines, which includes statistical models, deep learning models, and foundation models. We discuss each model in the context of various benchmark characteristics and offer a qualitative analysis that spans both deep learning and foundation models. We believe the insights from this analysis, along with access to this new standard zero-shot time series forecasting benchmark, will guide future developments in time series foundation models. Code, data, and the leaderboard can be found at https://github.com/SalesforceAIResearch/gift-eval .

69.4CVMar 23
Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Xu Liu, Yongheng Zhang, Qiguang Chen et al.

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

CVJan 9
Learning Geometric Invariance for Gait Recognition

Zengbin Wang, Junjie Li, Saihui Hou et al.

The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a $\mathcal{R}$eflect-$\mathcal{R}$otate-$\mathcal{S}$cale invariance learning framework, named ${\mathcal{RRS}}$-Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.

CVNov 7, 2025
Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

Jing Jin, Xu Liu, Te Gao et al.

Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.

52.3AIMar 11
A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

Yichi Zhu, Kan Ling, Xu Liu et al.

Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.

51.2NIMay 21
Astragalus: Automatic Configuration Repair for Production Networks

Zhenrong Gu, Peng Zhang, Xing Feng et al.

Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a semantic-driven approach: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a syntax-driven approach, which tries to repair program bugs by ``grafting'' some existing code in the same repository, without modeling program semantics. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a ``localize-fix-validate'' pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5\% of the incidents on a real network, both with 15 types of errors injected, within an average time of 7.36 seconds. It has also provided valid repair options in under 6 minutes for 4 recent network incidents or undesired changes, in a real production network with O(1,000)Õ(10,000) devices.

IVApr 17, 2025Code
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

Xin Li, Kun Yuan, Bingchen Li et al.

This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.

CRNov 4, 2025
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Xu Liu, Yan Chen, Kan Ling et al.

The widespread deployment of Large Language Models (LLMs) as public-facing web services and APIs has made their security a core concern for the web ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have recently attracted extensive research. In this paper, we reveal a jailbreak strategy which can effectively evade current defense strategies. It can extract valuable information from failed or partially successful attack attempts and contains self-evolution from attack interactions, resulting in sufficient strategy diversity and adaptability. Inspired by continuous learning and modular design principles, we propose ASTRA, a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies to achieve more efficient and adaptive attacks. To enable this autonomous evolution, we design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not only generates attack prompts but also automatically distills and generalizes reusable attack strategies from every interaction. To systematically accumulate and apply this attack knowledge, we introduce a three-tier strategy library that categorizes strategies into Effective, Promising, and Ineffective based on their performance scores. The strategy library not only provides precise guidance for attack generation but also possesses exceptional extensibility and transferability. We conduct extensive experiments under a black-box setting, and the results show that ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.

SDSep 7, 2023
Spiking Structured State Space Model for Monaural Speech Enhancement

Yu Du, Xu Liu, Yansong Chua

Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence modeling capabilities of Structured State Space Models (S4), offering a compelling solution. Evaluation on the DNS Challenge and VoiceBank+Demand Datasets confirms that Spiking-S4 rivals existing Artificial Neural Network (ANN) methods but with fewer computational resources, as evidenced by reduced parameters and Floating Point Operations (FLOPs).