CVJul 19, 2022Code
ParticleSfM: Exploiting Dense Point Trajectories for Localizing Moving Cameras in the WildWang Zhao, Shaohui Liu, Hengkai Guo et al.
Estimating the pose of a moving camera from monocular video is a challenging problem, especially due to the presence of moving objects in dynamic environments, where the performance of existing camera pose estimation methods are susceptible to pixels that are not geometrically consistent. To tackle this challenge, we present a robust dense indirect structure-from-motion method for videos that is based on dense correspondence initialized from pairwise optical flow. Our key idea is to optimize long-range video correspondence as dense point trajectories and use it to learn robust estimation of motion segmentation. A novel neural network architecture is proposed for processing irregular point trajectory data. Camera poses are then estimated and optimized with global bundle adjustment over the portion of long-range point trajectories that are classified as static. Experiments on MPI Sintel dataset show that our system produces significantly more accurate camera trajectories compared to existing state-of-the-art methods. In addition, our method is able to retain reasonable accuracy of camera poses on fully static scenes, which consistently outperforms strong state-of-the-art dense correspondence based methods with end-to-end deep learning, demonstrating the potential of dense indirect methods based on optical flow and point trajectories. As the point trajectory representation is general, we further present results and comparisons on in-the-wild monocular videos with complex motion of dynamic objects. Code is available at https://github.com/bytedance/particle-sfm.
CVMar 30, 2023Code
3D Line Mapping RevisitedShaohui Liu, Yifan Yu, Rémi Pautrat et al.
In contrast to sparse keypoints, a handful of line segments can concisely encode the high-level scene layout, as they often delineate the main structural elements. In addition to offering strong geometric cues, they are also omnipresent in urban landscapes and indoor scenes. Despite their apparent advantages, current line-based reconstruction methods are far behind their point-based counterparts. In this paper we aim to close the gap by introducing LIMAP, a library for 3D line mapping that robustly and efficiently creates 3D line maps from multi-view imagery. This is achieved through revisiting the degeneracy problem of line triangulation, carefully crafted scoring and track building, and exploiting structural priors such as line coincidence, parallelism, and orthogonality. Our code integrates seamlessly with existing point-based Structure-from-Motion methods and can leverage their 3D points to further improve the line reconstruction. Furthermore, as a byproduct, the method is able to recover 3D association graphs between lines and points / vanishing points (VPs). In thorough experiments, we show that LIMAP significantly outperforms existing approaches for 3D line mapping. Our robust 3D line maps also open up new research directions. We show two example applications: visual localization and bundle adjustment, where integrating lines alongside points yields the best results. Code is available at https://github.com/cvg/limap.
CVAug 21, 2023Code
Vanishing Point Estimation in Uncalibrated Images with Prior Gravity DirectionRémi Pautrat, Shaohui Liu, Petr Hruby et al.
We tackle the problem of estimating a Manhattan frame, i.e. three orthogonal vanishing points, and the unknown focal length of the camera, leveraging a prior vertical direction. The direction can come from an Inertial Measurement Unit that is a standard component of recent consumer devices, e.g., smartphones. We provide an exhaustive analysis of minimal line configurations and derive two new 2-line solvers, one of which does not suffer from singularities affecting existing solvers. Additionally, we design a new non-minimal method, running on an arbitrary number of lines, to boost the performance in local optimization. Combining all solvers in a hybrid robust estimator, our method achieves increased accuracy even with a rough prior. Experiments on synthetic and real-world datasets demonstrate the superior accuracy of our method compared to the state of the art, while having comparable runtimes. We further demonstrate the applicability of our solvers for relative rotation estimation. The code is available at https://github.com/cvg/VP-Estimation-with-Prior-Gravity.
CVSep 29, 2024Code
Robust Incremental Structure-from-Motion with Hybrid FeaturesShaohui Liu, Yidan Gao, Tianyi Zhang et al.
Structure-from-Motion (SfM) has become a ubiquitous tool for camera calibration and scene reconstruction with many downstream applications in computer vision and beyond. While the state-of-the-art SfM pipelines have reached a high level of maturity in well-textured and well-configured scenes over the last decades, they still fall short of robustly solving the SfM problem in challenging scenarios. In particular, weakly textured scenes and poorly constrained configurations oftentimes cause catastrophic failures or large errors for the primarily keypoint-based pipelines. In these scenarios, line segments are often abundant and can offer complementary geometric constraints. Their large spatial extent and typically structured configurations lead to stronger geometric constraints as compared to traditional keypoint-based methods. In this work, we introduce an incremental SfM system that, in addition to points, leverages lines and their structured geometric relations. Our technical contributions span the entire pipeline (mapping, triangulation, registration) and we integrate these into a comprehensive end-to-end SfM system that we share as an open-source software with the community. We also present the first analytical method to propagate uncertainties for 3D optimized lines via sensitivity analysis. Experiments show that our system is consistently more robust and accurate compared to the widely used point-based state of the art in SfM -- achieving richer maps and more precise camera registrations, especially under challenging conditions. In addition, our uncertainty-aware localization module alone is able to consistently improve over the state of the art under both point-alone and hybrid setups.
CVSep 27, 2023
Handbook on Leveraging Lines for Two-View Relative Pose EstimationPetr Hruby, Shaohui Liu, Rémi Pautrat et al.
We propose an approach for estimating the relative pose between calibrated image pairs by jointly exploiting points, lines, and their coincidences in a hybrid manner. We investigate all possible configurations where these data modalities can be used together and review the minimal solvers available in the literature. Our hybrid framework combines the advantages of all configurations, enabling robust and accurate estimation in challenging environments. In addition, we design a method for jointly estimating multiple vanishing point correspondences in two images, and a bundle adjustment that considers all relevant data modalities. Experiments on various indoor and outdoor datasets show that our approach outperforms point-based methods, improving AUC@10$^\circ$ by 1-7 points while running at comparable speeds. The source code of the solvers and hybrid framework will be made public.
CVJul 19, 2023
Lazy Visual Localization via Motion AveragingSiyan Dong, Shaohui Liu, Hengkai Guo et al.
Visual (re)localization is critical for various applications in computer vision and robotics. Its goal is to estimate the 6 degrees of freedom (DoF) camera pose for each query image, based on a set of posed database images. Currently, all leading solutions are structure-based that either explicitly construct 3D metric maps from the database with structure-from-motion, or implicitly encode the 3D information with scene coordinate regression models. On the contrary, visual localization without reconstructing the scene in 3D offers clear benefits. It makes deployment more convenient by reducing database pre-processing time, releasing storage requirements, and remaining unaffected by imperfect reconstruction, etc. In this technical report, we demonstrate that it is possible to achieve high localization accuracy without reconstructing the scene from the database. The key to achieving this owes to a tailored motion averaging over database-query pairs. Experiments show that our visual localization proposal, LazyLoc, achieves comparable performance against state-of-the-art structure-based methods. Furthermore, we showcase the versatility of LazyLoc, which can be easily extended to handle complex configurations such as multi-query co-localization and camera rigs.
CLDec 29, 2025Code
MiMo-Audio: Audio Language Models are Few-Shot LearnersXiaomi LLM-Core Team, Dong Zhang, Gang Wang et al.
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.
CVAug 3, 2022
Fast Hierarchical Deep Unfolding Network for Image Compressed SensingWenxue Cui, Shaohui Liu, Debin Zhao
By integrating certain optimization solvers with deep neural network, deep unfolding network (DUN) has attracted much attention in recent years for image compressed sensing (CS). However, there still exist several issues in existing DUNs: 1) For each iteration, a simple stacked convolutional network is usually adopted, which apparently limits the expressiveness of these models. 2) Once the training is completed, most hyperparameters of existing DUNs are fixed for any input content, which significantly weakens their adaptability. In this paper, by unfolding the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA), a novel fast hierarchical DUN, dubbed FHDUN, is proposed for image compressed sensing, in which a well-designed hierarchical unfolding architecture is developed to cooperatively explore richer contextual prior information in multi-scale spaces. To further enhance the adaptability, series of hyperparametric generation networks are developed in our framework to dynamically produce the corresponding optimal hyperparameters according to the input content. Furthermore, due to the accelerated policy in FISTA, the newly embedded acceleration module makes the proposed FHDUN save more than 50% of the iterative loops against recent DUNs. Extensive CS experiments manifest that the proposed FHDUN outperforms existing state-of-the-art CS methods, while maintaining fewer iterations.
SYMay 16, 2022
Topology-aware Graph Neural Networks for Learning Feasible and Adaptive ac-OPF SolutionsShaohui Liu, Chengyang Wu, Hao Zhu
Solving the optimal power flow (OPF) problem is a fundamental task to ensure the system efficiency and reliability in real-time electricity grid operations. We develop a new topology-informed graph neural network (GNN) approach for predicting the optimal solutions of real-time ac-OPF problem. To incorporate grid topology to the NN model, the proposed GNN-for-OPF framework innovatively exploits the locality property of locational marginal prices and voltage magnitude. Furthermore, we develop a physics-aware (ac-)flow feasibility regularization approach for general OPF learning. The advantages of our proposed designs include reduced model complexity, improved generalizability and feasibility guarantees. By providing the analytical understanding on the graph subspace stability under grid topology contingency, we show the proposed GNN can quickly adapt to varying grid topology by an efficient re-training strategy. Numerical tests on various test systems of different sizes have validated the prediction accuracy, improved flow feasibility, and topology adaptivity capability of our proposed GNN-based learning framework.
LGDec 19, 2022
Wind Power Scenario Generation Using Graph Convolutional Generative Adversarial NetworkYoung-ho Cho, Shaohui Liu, Duehee Lee et al.
Generating wind power scenarios is very important for studying the impacts of multiple wind farms that are interconnected to the grid. We develop a graph convolutional generative adversarial network (GCGAN) approach by leveraging GAN's capability in generating large number of realistic scenarios without using statistical modeling. Unlike existing GAN-based wind power data generation approaches, we design GAN's hidden layers to match the underlying spatial and temporal characteristics. We advocate the use of graph filters to embed the spatial correlation among multiple wind farms, and a one-dimensional (1D) convolutional layer to represent the temporal feature filters. The proposed graph and feature filter design significantly reduce the GAN model complexity, leading to improvements in training efficiency and computation complexity. Numerical results using real wind power data from Australia demonstrate that the scenarios generated by the proposed GCGAN exhibit more realistic spatial and temporal statistics than other GAN-based outputs.
CVNov 13, 2023
CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human FeelingsYachun Mi, Yu Li, Yan Shu et al.
Video Quality Assessment (VQA) aims to simulate the process of perceiving video quality by the human visual system (HVS). The judgments made by HVS are always influenced by human subjective feelings. However, most of the current VQA research focuses on capturing various distortions in the spatial and temporal domains of videos, while ignoring the impact of human feelings. In this paper, we propose CLiF-VQA, which considers both features related to human feelings and spatial features of videos. In order to effectively extract features related to human feelings from videos, we explore the consistency between CLIP and human feelings in video perception for the first time. Specifically, we design multiple objective and subjective descriptions closely related to human feelings as prompts. Further we propose a novel CLIP-based semantic feature extractor (SFE) which extracts features related to human feelings by sliding over multiple regions of the video frame. In addition, we further capture the low-level-aware features of the video through a spatial feature extraction module. The two different features are then aggregated thereby obtaining the quality score of the video. Extensive experiments show that the proposed CLiF-VQA exhibits excellent performance on several VQA datasets.
CVFeb 28, 2023
Read Pointer Meters in complex environments based on a Human-like Alignment and Recognition AlgorithmYan Shu, Shaohui Liu, Honglei Xu et al.
Recently, developing an automatic reading system for analog measuring instruments has gained increased attention, as it enables the collection of numerous state of equipment. Nonetheless, two major obstacles still obstruct its deployment to real-world applications. The first issue is that they rarely take the entire pipeline's speed into account. The second is that they are incapable of dealing with some low-quality images (i.e., meter breakage, blur, and uneven scale). In this paper, we propose a human-like alignment and recognition algorithm to overcome these problems. More specifically, a Spatial Transformed Module(STM) is proposed to obtain the front view of images in a self-autonomous way based on an improved Spatial Transformer Networks(STN). Meanwhile, a Value Acquisition Module(VAM) is proposed to infer accurate meter values by an end-to-end trained framework. In contrast to previous research, our model aligns and recognizes meters totally implemented by learnable processing, which mimics human's behaviours and thus achieves higher performances. Extensive results verify the good robustness of the proposed model in terms of the accuracy and efficiency.
CLMar 28, 2023
How can Deep Learning Retrieve the Write-Missing Additional Diagnosis from Chinese Electronic Medical Record For DRGShaohui Liu, Xien Liu, Ji Wu · tsinghua
The purpose of write-missing diagnosis detection is to find diseases that have been clearly diagnosed from medical records but are missed in the discharge diagnosis. Unlike the definition of missed diagnosis, the write-missing diagnosis is clearly manifested in the medical record without further reasoning. The write-missing diagnosis is a common problem, often caused by physician negligence. The write-missing diagnosis will result in an incomplete diagnosis of medical records. While under DRG grouping, the write-missing diagnoses will miss important additional diagnoses (CC, MCC), thus affecting the correct rate of DRG enrollment. Under the circumstance that countries generally start to adopt DRG enrollment and payment, the problem of write-missing diagnosis is a common and serious problem. The current manual-based method is expensive due to the complex content of the full medical record. We think this problem is suitable to be solved as natural language processing. But to the best of our knowledge, no researchers have conducted research on this problem based on natural language processing methods. We propose a framework for solving the problem of write-missing diagnosis, which mainly includes three modules: disease recall module, disease context logic judgment module, and disease relationship comparison module. Through this framework, we verify that the problem of write-missing diagnosis can be solved well, and the results are interpretable. At the same time, we propose advanced solutions for the disease context logic judgment module and disease relationship comparison module, which have obvious advantages compared with the mainstream methods of the same type of problems. Finally, we verified the value of our proposed framework under DRG medical insurance payment in a tertiary hospital.
CRDec 3, 2025Code
Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMsTengyun Ma, Jiaqi Yao, Daojing He et al.
Large Language Models (LLMs) have emerged as powerful tools for diverse applications. However, their uniform token processing paradigm introduces critical vulnerabilities in instruction handling, particularly when exposed to adversarial scenarios. In this work, we identify and propose a novel class of vulnerabilities, termed Tool-Completion Attack (TCA), which exploits function-calling mechanisms to subvert model behavior. To evaluate LLM robustness against such threats, we introduce the Tool-Completion benchmark, a comprehensive security assessment framework, which reveals that even state-of-the-art models remain susceptible to TCA, with surprisingly high attack success rates. To address these vulnerabilities, we introduce Context-Aware Hierarchical Learning (CAHL), a sophisticated mechanism that dynamically balances semantic comprehension with role-specific instruction constraints. CAHL leverages the contextual correlations between different instruction segments to establish a robust, context-aware instruction hierarchy. Extensive experiments demonstrate that CAHL significantly enhances LLM robustness against both conventional attacks and the proposed TCA, exhibiting strong generalization capabilities in zero-shot evaluations while still preserving model performance on generic tasks. Our code is available at https://github.com/S2AILab/CAHL.
CVDec 11, 2024Code
Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual LocalizationSiyan Dong, Shuzhe Wang, Shaohui Liu et al.
Visual localization aims to determine the camera pose of a query image relative to a database of posed images. In recent years, deep neural networks that directly regress camera poses have gained popularity due to their fast inference capabilities. However, existing methods struggle to either generalize well to new scenes or provide accurate camera pose estimates. To address these issues, we present Reloc3r, a simple yet effective visual localization framework. It consists of an elegantly designed relative pose regression network, and a minimalist motion averaging module for absolute pose estimation. Trained on approximately eight million posed image pairs, Reloc3r achieves surprisingly good performance and generalization ability. We conduct extensive experiments on six public datasets, consistently demonstrating the effectiveness and efficiency of the proposed method. It provides high-quality camera pose estimates in real time and generalizes to novel scenes. Code: https://github.com/ffrivera0/reloc3r.
CLMay 12, 2025Code
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to PosttrainingLLM-Core Xiaomi, Bingquan Xia, Bowen Shen et al. · pku
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
MSNov 4, 2016
A Functional Package for Automatic Solution of Ordinary Differential Equations with Spectral MethodsShaohui Liu, Tianshi Wang, Youran Zhang
We present a Python module named PyCheb, to solve the ordinary differential equations by using spectral collocation method. PyCheb incorporates discretization using Chebyshev points, barycentric interpolation and iterate methods. With this Python module, users can initialize the ODEsolver class by passing attributes, including the both sides of a given differential equation, boundary conditions, and the number of Chebyshev points, which can also be generated automatically by the ideal precision, to the constructor of ODEsolver class. Then, the instance of the ODEsolver class can be used to automatically determine the resolution of the differential equation as well as generate the graph of the high-precision approximate solution. (If you have any questions, please send me an email and I will reply ASAP. e-mail:shaohui_liu@qq.com/2013141482143@stu.scu.edu.cn)
CVMay 18
Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive InteractionYu Li, Puchao Zhou, Yachun Mi et al.
In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.
CLJun 4, 2025Code
MiMo-VL Technical ReportXiaomi LLM-Core Team, Zihao Yue, Zhenru Lin et al. · pku
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.
CVJan 9, 2025Code
Relative Pose Estimation through Affine Corrections of Monocular Depth PriorsYifan Yu, Shaohui Liu, Rémi Pautrat et al.
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at https://github.com/MarkYu98/madpose.
SYSep 22, 2024
A Unified Approach for Learning the Dynamics of Power System Generators and Inverter-based ResourcesShaohui Liu, Weiqian Cai, Hao Zhu et al.
The growing prevalence of inverter-based resources (IBRs) for renewable energy integration and electrification greatly challenges power system dynamic analysis. To account for both synchronous generators (SGs) and IBRs, this work presents an approach for learning the model of an individual dynamic component. The recurrent neural network (RNN) model is used to match the recursive structure in predicting the key dynamical states of a component from its terminal bus voltage and set-point input. To deal with the fast transients especially due to IBRs, we develop a Stable Integral (SI-)RNN to mimic high-order integral methods that can enhance the stability and accuracy for the dynamic learning task. We demonstrate that the proposed SI-RNN model not only can successfully predict the component's dynamic behaviors, but also offers the possibility of efficiently computing the dynamic sensitivity relative to a set-point change. These capabilities have been numerically validated based on full-order Electromagnetic Transient (EMT) simulations on a small test system with both SGs and IBRs, particularly for predicting the dynamics of grid-forming inverters.
CLJun 2, 2023
Simple Data Augmentation Techniques for Chinese Disease NormalizationWenqian Cui, Xiangling Fu, Shaohui Liu et al.
Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Our proposed methods rely on the Structural Invariance property of disease names and the Hierarchy property of the disease classification system. The goal is to equip the models with extensive understanding of the disease names and the hierarchical structure of the disease name classification system. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data.
SYMay 15
Watts vs. Bytes: Turning Data Centers into Grid Assets via Storage Compute Co-OptimizationShaohui Liu, Sungho Shin, Deepjyoti Deka
Enabling continued data-center growth under increasing grid stress motivates closer coordination between flexible computing demand and co-located battery energy storage systems (BESS) to improve site operations and provide grid services. This paper develops a robust co-optimization framework for day-ahead operation of data centers with co-located BESS under utility-imposed interconnection limits on peak load and ramping. The model jointly considers deadline-constrained computing workloads, managed through workload scheduling and dynamic voltage and frequency scaling (DVFS), together with degradation-aware BESS dispatch to enable cost optimization and participation in ancillary-service markets. Case studies based on real-world market and workload data show that the proposed framework yields feasible day-ahead schedules across a range of operating conditions, with substantially larger benefits when interconnection constraints become binding. Under baseline conditions, BESS value is derived from both ancillary-service participation and improved workload and energy management. Under stressed peak-load and ramping limits, however, the daily value of BESS increases by a factor of two or more, driven primarily \revise{by BESS actions to reduce the potential incompletion in the schedulable workload while complying with interconnection constraints}. Under tight peak-load caps, workload composition also matters where a higher share of non-schedulable jobs can increase operating cost by more than 25\% relative to more flexible workload mixes. \revise{Additionally, DVFS studies further show that processor-level control is a material flexibility lever under tight load limits.} These results demonstrate that coordinated compute-storage flexibility can materially expand the operational headroom and grid value of data centers, especially under increasingly scarce grid capacity.
CVDec 3, 2025
GAOT: Generating Articulated Objects Through Text-Guided Diffusion ModelsHao Sun, Lei Fan, Donglin Di et al.
Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.
IVJun 22, 2025Code
LVPNet: A Latent-variable-based Prediction-driven End-to-end Framework for Lossless Compression of Medical ImagesChenyue Song, Chen Hui, Qing Lin et al.
Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at https://github.com/scy-Jackel/LVPNet.
CVMay 10
Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive SensingTianyi Lu, Wenxue Cui, Shaohui Liu
Recent Deep Unfolding Networks (DUNs) have significantly advanced Compressive Sensing (CS) by integrating iterative optimization with deep networks. However, existing DUNs still suffer from two challenges: 1) Reliance on a single measurement stream, which limits effective information interaction across distinct measurement subsets. 2) Uniform processing of all image regions, which overlooks varying reconstruction difficulties induced by diverse textures. To address these limitations, a novel Dual-Path Hyperprior Informed Deep Unfolding Network (DPH-DUN) is proposed, which partitions measurements into double subsets to enable hyperprior-guided reconstruction via a dual-path architecture. In the Deep Hyperprior Learning branch, a series of lightweight neural modules are designed to efficiently generate hyperprior knowledge of different domains, enabling collaborative guidance for the CS reconstruction. In the Hyperprior Informed Reconstruction branch, a deep unfolding framework with hyperprior guidance is constructed to iteratively refine reconstruction. Specifically, i) in the gradient descent step, a Hyperprior Informed Step Size Generation network is designed to dynamically generate spatially varying step maps, enabling adaptive fine-grained gradient updates. ii) In the proximal mapping step, two well-designed hyperprior informed attention mechanisms are introduced to dynamically focus on challenging regions via gradient-based hard and soft attentions, facilitating CS reconstruction accuracy. Extensive experiments demonstrate that the proposed DPH-DUN outperforms existing CS methods.
CVOct 18, 2025Code
LightGlueStick: a Fast and Robust Glue for Joint Point-Line MatchingAidyn Ubingazhibov, Rémi Pautrat, Iago Suárez et al.
Lines and points are complementary local features, whose combination has proven effective for applications such as SLAM and Structure-from-Motion. The backbone of these pipelines are the local feature matchers, establishing correspondences across images. Traditionally, point and line matching have been treated as independent tasks. Recently, GlueStick proposed a GNN-based network that simultaneously operates on points and lines to establish matches. While running a single joint matching reduced the overall computational complexity, the heavy architecture prevented real-time applications or deployment to edge devices. Inspired by recent progress in point matching, we propose LightGlueStick, a lightweight matcher for points and line segments. The key novel component in our architecture is the Attentional Line Message Passing (ALMP), which explicitly exposes the connectivity of the lines to the network, allowing for efficient communication between nodes. In thorough experiments we show that LightGlueStick establishes a new state-of-the-art across different benchmarks. The code is available at https://github.com/aubingazhib/LightGlueStick.
CVJan 19, 2022Code
A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view StereoWang Zhao, Shaohui Liu, Yi Wei et al.
In this paper, we introduce a deep multi-view stereo (MVS) system that jointly predicts depths, surface normals and per-view confidence maps. The key to our approach is a novel solver that iteratively solves for per-view depth map and normal map by optimizing an energy potential based on the locally planar assumption. Specifically, the algorithm updates depth map by propagating from neighboring pixels with slanted planes, and updates normal map with local probabilistic plane fitting. Both two steps are monitored by a customized confidence map. This solver is not only effective as a post-processing tool for plane-based depth refinement and completion, but also differentiable such that it can be efficiently integrated into deep learning pipelines. Our multi-view stereo system employs multiple optimization steps of the solver over the initial prediction of depths and surface normals. The whole system can be trained end-to-end, decoupling the challenging problem of matching pixels within poorly textured regions from the cost-volume based neural network. Experimental results on ScanNet and RGB-D Scenes V2 demonstrate state-of-the-art performance of the proposed deep MVS system on multi-view depth estimation, with our proposed solver consistently improving the depth quality over both conventional and deep learning based MVS pipelines. Code is available at https://github.com/thuzhaowang/idn-solver.
CVSep 2, 2021Code
NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view StereoYi Wei, Shaohui Liu, Yongming Rao et al.
In this work, we present a new multi-view depth estimation method that utilizes both conventional reconstruction and learning-based priors over the recently proposed neural radiance fields (NeRF). Unlike existing neural network based optimization method that relies on estimated correspondences, our method directly optimizes over implicit volumes, eliminating the challenging step of matching pixels in indoor scenes. The key to our approach is to utilize the learning-based priors to guide the optimization process of NeRF. Our system firstly adapts a monocular depth network over the target scene by finetuning on its sparse SfM+MVS reconstruction from COLMAP. Then, we show that the shape-radiance ambiguity of NeRF still exists in indoor environments and propose to address the issue by employing the adapted depth priors to monitor the sampling process of volume rendering. Finally, a per-pixel confidence map acquired by error computation on the rendered image can be used to further improve the depth quality. Experiments show that our proposed framework significantly outperforms state-of-the-art methods on indoor scenes, with surprising findings presented on the effectiveness of correspondence-based optimization and NeRF-based optimization over the adapted depth priors. In addition, we show that the guided optimization scheme does not sacrifice the original synthesis capability of neural radiance fields, improving the rendering quality on both seen and novel views. Code is available at https://github.com/weiyithu/NerfingMVS.
CVApr 3, 2020Code
Towards Better Generalization: Joint Depth-Pose Learning without PoseNetWang Zhao, Shaohui Liu, Yezhi Shu et al.
In this work, we tackle the essential problem of scale inconsistency for self-supervised joint depth-pose learning. Most existing methods assume that a consistent scale of depth and pose can be learned across all input samples, which makes the learning problem harder, resulting in degraded performance and limited generalization in indoor environments and long-sequence visual odometry application. To address this issue, we propose a novel system that explicitly disentangles scale from the network estimation. Instead of relying on PoseNet architecture, our method recovers relative pose by directly solving fundamental matrix from dense optical flow correspondence and makes use of a two-view triangulation module to recover an up-to-scale 3D structure. Then, we align the scale of the depth prediction with the triangulated point cloud and use the transformed depth map for depth error computation and dense reprojection check. Our whole system can be jointly trained end-to-end. Extensive experiments show that our system not only reaches state-of-the-art performance on KITTI depth and flow estimation, but also significantly improves the generalization ability of existing self-supervised depth-pose learning methods under a variety of challenging scenarios, and achieves state-of-the-art results among self-supervised learning-based methods on KITTI Odometry and NYUv2 dataset. Furthermore, we present some interesting findings on the limitation of PoseNet-based relative pose estimation methods in terms of generalization ability. Code is available at https://github.com/B1ueber2y/TrianFlow.
CVApr 25, 2019Code
RepPoints: Point Set Representation for Object DetectionZe Yang, Shaohui Liu, Han Hu et al.
Modern object detectors rely heavily on rectangular bounding boxes, such as anchors, proposals and the final predictions, to represent objects at various recognition stages. The bounding box is convenient to use but provides only a coarse localization of objects and leads to a correspondingly coarse extraction of object features. In this paper, we present \textbf{RepPoints} (representative points), a new finer representation of objects as a set of sample points useful for both localization and recognition. Given ground truth localization and recognition targets for training, RepPoints learn to automatically arrange themselves in a manner that bounds the spatial extent of an object and indicates semantically significant local areas. They furthermore do not require the use of anchors to sample a space of bounding boxes. We show that an anchor-free object detector based on RepPoints can be as effective as the state-of-the-art anchor-based detection methods, with 46.5 AP and 67.4 $AP_{50}$ on the COCO test-dev detection benchmark, using ResNet-101 model. Code is available at https://github.com/microsoft/RepPoints.
NIMay 9
Locational Pricing for Generative-AI Services via Token-Flow Market ClearingShaohui Liu
GenAI services are in an early yet fast expanding phase. Providers compete on model capability and service quality, while the underlying infrastructure remains expensive and heterogeneous across regions, workloads, and compute assets. If these services diffuse into routine daily use, the relevant engineering problem becomes not only better models but also efficient dispatch on a geographically distributed AI service infrastructure. To address this, we formulate a network-constrained token-flow market that clears AI workloads across compute nodes and communication links. The baseline model is a linear program that co-optimizes routing and processing subject to compute-capacity and bandwidth constraints; its dual variables define location- and workload-specific marginal service prices. We further introduce a transfer-aware extension that prices data movement in physical units and isolates bandwidth congestion rents. In a 5-node U.S. case study, the transfer-aware model uncovers four saturated backbone links and raises total operating cost by 2.7\% relative to the token-equivalent baseline, while tightening the chatbot latency limit from 100~ms to 15~ms increases one locational price by 117\%. A 20-node scale-up exhibits the same merit-order dispatch logic and becomes infeasible once demand exceeds aggregate capacity. These results suggest that locational pricing is a useful organizing principle for operating an emerging AI service infrastructure and, over time, for designing competitive markets around it.
ROMay 2, 2024
NeRFs in Robotics: A SurveyGuangming Wang, Lei Pan, Songyou Peng et al.
Detailed and realistic 3D environment representations have been a long-standing goal in the fields of computer vision and robotics. The recent emergence of neural implicit representations has introduced significant advances to these domains, enabling numerous novel capabilities. Among these, Neural Radiance Fields (NeRFs) have gained considerable attention because of their considerable representational advantages, such as simplified mathematical models, low memory footprint, and continuous scene representations. In addition to computer vision, NeRFs have demonstrated significant potential in robotics. Thus, we present this survey to provide a comprehensive understanding of NeRFs in the field of robotics. By exploring the advantages and limitations of NeRF as well as its current applications and future potential, we aim to provide an overview of this promising area of research. Our survey is divided into two main sections: \textit{Applications of NeRFs in Robotics} and \textit{Advances for NeRFs in Robotics}, from the perspective of how NeRF enters the field of robotics. In the first section, we introduce and analyze some works that have been or could be used in robotics for perception and interaction tasks. In the second section, we show some works related to improving NeRF's own properties, which are essential for deploying NeRFs in robotics. In the discussion section of the review, we summarize the existing challenges and provide valuable future research directions.
CVMay 28, 2025
VidText: Towards Comprehensive Evaluation for Video Text UnderstandingZhoufaran Yang, Yan Shu, Jing Wang et al. · pku
Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.
CVNov 29, 2024
AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular VideosYuze He, Wang Zhao, Shaohui Liu et al.
We introduce AlphaTablets, a novel and generic representation of 3D planes that features continuous 3D surface and precise boundary delineation. By representing 3D planes as rectangles with alpha channels, AlphaTablets combine the advantages of current 2D and 3D plane representations, enabling accurate, consistent and flexible modeling of 3D planes. We derive differentiable rasterization on top of AlphaTablets to efficiently render 3D planes into images, and propose a novel bottom-up pipeline for 3D planar reconstruction from monocular videos. Starting with 2D superpixels and geometric cues from pre-trained models, we initialize 3D planes as AlphaTablets and optimize them via differentiable rendering. An effective merging scheme is introduced to facilitate the growth and refinement of AlphaTablets. Through iterative optimization and merging, we reconstruct complete and accurate 3D planes with solid surfaces and clear boundaries. Extensive experiments on the ScanNet dataset demonstrate state-of-the-art performance in 3D planar reconstruction, underscoring the great potential of AlphaTablets as a generic 3D plane representation for various applications. Project page is available at: https://hyzcluster.github.io/alphatablets
IVFeb 29, 2024
Deep Network for Image Compressed Sensing Coding Using Local Structural SamplingWenxue Cui, Xingtao Wang, Xiaopeng Fan et al.
Existing image compressed sensing (CS) coding frameworks usually solve an inverse problem based on measurement coding and optimization-based image reconstruction, which still exist the following two challenges: 1) The widely used random sampling matrix, such as the Gaussian Random Matrix (GRM), usually leads to low measurement coding efficiency. 2) The optimization-based reconstruction methods generally maintain a much higher computational complexity. In this paper, we propose a new CNN based image CS coding framework using local structural sampling (dubbed CSCNet) that includes three functional modules: local structural sampling, measurement coding and Laplacian pyramid reconstruction. In the proposed framework, instead of GRM, a new local structural sampling matrix is first developed, which is able to enhance the correlation between the measurements through a local perceptual sampling strategy. Besides, the designed local structural sampling matrix can be jointly optimized with the other functional modules during training process. After sampling, the measurements with high correlations are produced, which are then coded into final bitstreams by the third-party image codec. At last, a Laplacian pyramid reconstruction network is proposed to efficiently recover the target image from the measurement domain to the image domain. Extensive experimental results demonstrate that the proposed scheme outperforms the existing state-of-the-art CS coding methods, while maintaining fast computational speed.
CVJun 22, 2025
BPCLIP: A Bottom-up Image Quality Assessment from Distortion to Semantics Based on CLIPChenyue Song, Chen Hui, Wei Zhang et al.
Image Quality Assessment (IQA) aims to evaluate the perceptual quality of images based on human subjective perception. Existing methods generally combine multiscale features to achieve high performance, but most rely on straightforward linear fusion of these features, which may not adequately capture the impact of distortions on semantic content. To address this, we propose a bottom-up image quality assessment approach based on the Contrastive Language-Image Pre-training (CLIP, a recently proposed model that aligns images and text in a shared feature space), named BPCLIP, which progressively extracts the impact of low-level distortions on high-level semantics. Specifically, we utilize an encoder to extract multiscale features from the input image and introduce a bottom-up multiscale cross attention module designed to capture the relationships between shallow and deep features. In addition, by incorporating 40 image quality adjectives across six distinct dimensions, we enable the pre-trained CLIP text encoder to generate representations of the intrinsic quality of the image, thereby strengthening the connection between image quality perception and human language. Our method achieves superior results on most public Full-Reference (FR) and No-Reference (NR) IQA benchmarks, while demonstrating greater robustness.
CVAug 11, 2025
Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language ModelsChenyue Song, Chen Hui, Haiqi Zhu et al.
No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.
CVApr 22, 2025
MVQA: Mamba with Unified Sampling for Efficient Video Quality AssessmentYachun Mi, Yu Li, Weicheng Meng et al.
The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being $2\times$ as fast and requiring only $1/5$ GPU memory.
CVApr 23, 2024
SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and TransformerTong Zhang, Wenxue Cui, Shaohui Liu et al.
Convolutional Neural Network (CNN) and Transformer have attracted much attention recently for video post-processing (VPP). However, the interaction between CNN and Transformer in existing VPP methods is not fully explored, leading to inefficient communication between the local and global extracted features. In this paper, we explore the interaction between CNN and Transformer in the task of VPP, and propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains. Specifically, in the spatial domain, a novel spatial attention fusion module is designed, in which two attention weights are generated to fuse the local and global representations collaboratively. In the channel domain, a novel channel attention fusion module is developed, which can blend the deep representations at the channel dimension dynamically. Extensive experiments show that SC-HVPPNet notably boosts video restoration quality, with average bitrate savings of 5.29%, 12.42%, and 13.09% for Y, U, and V components in the VTM-11.0-NNVC RA configuration.
CVJan 14, 2024
Depth-agnostic Single Image DehazingHonglei Xu, Yan Shu, Shaohui Liu
Single image dehazing is a challenging ill-posed problem. Existing datasets for training deep learning-based methods can be generated by hand-crafted or synthetic schemes. However, the former often suffers from small scales, while the latter forces models to learn scene depth instead of haze distribution, decreasing their dehazing ability. To overcome the problem, we propose a simple yet novel synthetic method to decouple the relationship between haze density and scene depth, by which a depth-agnostic dataset (DA-HAZE) is generated. Meanwhile, a Global Shuffle Strategy (GSS) is proposed for generating differently scaled datasets, thereby enhancing the generalization ability of the model. Extensive experiments indicate that models trained on DA-HAZE achieve significant improvements on real-world benchmarks, with less discrepancy between SOTS and DA-SOTS (the test set of DA-HAZE). Additionally, Depth-agnostic dehazing is a more complicated task because of the lack of depth prior. Therefore, an efficient architecture with stronger feature modeling ability and fewer computational costs is necessary. We revisit the U-Net-based architectures for dehazing, in which dedicatedly designed blocks are incorporated. However, the performances of blocks are constrained by limited feature fusion methods. To this end, we propose a Convolutional Skip Connection (CSC) module, allowing vanilla feature fusion methods to achieve promising results with minimal costs. Extensive experimental results demonstrate that current state-of-the-art methods. equipped with CSC can achieve better performance and reasonable computational expense, whether the haze distribution is relevant to the scene depth.
CVSep 30, 2025
Benchmarking Egocentric Visual-Inertial SLAM at City ScaleAnusha Krishnan, Shaohui Liu, Paul-Edouard Sarlin et al.
Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at https://www.lamaria.ethz.ch.
CVAug 8, 2025
UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation LocalizationYachun Mi, Xingyang He, Shixin Sun et al.
In the digital age, advanced image editing tools pose a serious threat to the integrity of visual content, making image forgery detection and localization a key research focus. Most existing Image Manipulation Localization (IML) methods rely on discriminative learning and require large, high-quality annotated datasets. However, current datasets lack sufficient scale and diversity, limiting model performance in real-world scenarios. To overcome this, recent studies have explored Constrained IML (CIML), which generates pixel-level annotations through algorithmic supervision. However, existing CIML approaches often depend on complex multi-stage pipelines, making the annotation process inefficient. In this work, we propose a novel generative framework based on diffusion models, named UGD-IML, which for the first time unifies both IML and CIML tasks within a single framework. By learning the underlying data distribution, generative diffusion models inherently reduce the reliance on large-scale labeled datasets, allowing our approach to perform effectively even under limited data conditions. In addition, by leveraging a class embedding mechanism and a parameter-sharing design, our model seamlessly switches between IML and CIML modes without extra components or training overhead. Furthermore, the end-to-end design enables our model to avoid cumbersome steps in the data annotation process. Extensive experimental results on multiple datasets demonstrate that UGD-IML outperforms the SOTA methods by an average of 9.66 and 4.36 in terms of F1 metrics for IML and CIML tasks, respectively. Moreover, the proposed method also excels in uncertainty estimation, visualization and robustness.
CVAug 8, 2025
Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal AdaptationYachun Mi, Yu Li, Yanting Li et al.
Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
AIJul 29, 2025
GDAIP: A Graph-Based Domain Adaptive Framework for Individual Brain ParcellationJianfei Zhu, Haiqi Zhu, Shaohui Liu et al.
Recent deep learning approaches have shown promise in learning such individual brain parcellations from functional magnetic resonance imaging (fMRI). However, most existing methods assume consistent data distributions across domains and struggle with domain shifts inherent to real-world cross-dataset scenarios. To address this challenge, we proposed Graph Domain Adaptation for Individual Parcellation (GDAIP), a novel framework that integrates Graph Attention Networks (GAT) with Minimax Entropy (MME)-based domain adaptation. We construct cross-dataset brain graphs at both the group and individual levels. By leveraging semi-supervised training and adversarial optimization of the prediction entropy on unlabeled vertices from target brain graph, the reference atlas is adapted from the group-level brain graph to the individual brain graph, enabling individual parcellation under cross-dataset settings. We evaluated our method using parcellation visualization, Dice coefficient, and functional homogeneity. Experimental results demonstrate that GDAIP produces individual parcellations with topologically plausible boundaries, strong cross-session consistency, and ability of reflecting functional organization.
AIFeb 13, 2025
MIH-TCCT: Mitigating Inconsistent Hallucinations in LLMs via Event-Driven Text-Code Cyclic TrainingXinxin You, Xien Liu, Qixin Sun et al.
Recent methodologies utilizing synthetic datasets have aimed to address inconsistent hallucinations in large language models (LLMs); however,these approaches are primarily tailored to specific tasks, limiting their generalizability. Inspired by the strong performance of code-trained models in logic-intensive domains, we propose a novel framework that leverages event-based text to generate corresponding code and employs cyclic training to transfer the logical consistency of code to natural language effectively. Our method significantly reduces inconsistent hallucinations across three leading LLMs and two categories of natural language tasks while maintaining overall performance. This framework effectively alleviates hallucinations without necessitating adaptation to downstream tasks, demonstrating generality and providing new perspectives to tackle the challenge of inconsistent hallucinations.
CLJan 2, 2025
Data Augmentation Techniques for Chinese Disease Name NormalizationWenqian Cui, Xiangling Fu, Shaohui Liu et al.
Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data
IVDec 7, 2021
Image Compressed Sensing Using Non-local Neural NetworkWenxue Cui, Shaohui Liu, Feng Jiang et al.
Deep network-based image Compressed Sensing (CS) has attracted much attention in recent years. However, the existing deep network-based CS schemes either reconstruct the target image in a block-by-block manner that leads to serious block artifacts or train the deep network as a black box that brings about limited insights of image prior knowledge. In this paper, a novel image CS framework using non-local neural network (NL-CSNet) is proposed, which utilizes the non-local self-similarity priors with deep network to improve the reconstruction quality. In the proposed NL-CSNet, two non-local subnetworks are constructed for utilizing the non-local self-similarity priors in the measurement domain and the multi-scale feature domain respectively. Specifically, in the subnetwork of measurement domain, the long-distance dependencies between the measurements of different image blocks are established for better initial reconstruction. Analogically, in the subnetwork of multi-scale feature domain, the affinities between the dense feature representations are explored in the multi-scale space for deep reconstruction. Furthermore, a novel loss function is developed to enhance the coupling between the non-local representations, which also enables an end-to-end training of NL-CSNet. Extensive experiments manifest that NL-CSNet outperforms existing state-of-the-art CS methods, while maintaining fast computational speed.
LGOct 4, 2021
Risk-Aware Learning for Scalable Voltage Optimization in Distribution GridsShanny Lin, Shaohui Liu, Hao Zhu
Real-time coordination of distributed energy resources (DERs) is crucial for regulating the voltage profile in distribution grids. By capitalizing on a scalable neural network (NN) architecture, one can attain decentralized DER decisions to address the lack of real-time communications. This paper develops an advanced learning-enabled DER coordination scheme by accounting for the potential risks associated with reactive power prediction and voltage deviation. Such risks are quantified by the conditional value-at-risk (CVaR) using the worst-case samples only, and we propose a mini-batch selection algorithm to address the training speed issue in minimizing the CVaR-regularized loss. Numerical tests using real-world data on the IEEE 123-bus test case have demonstrated the computation and safety improvements of the proposed risk-aware learning algorithm for decentralized DER decision making, especially in terms of reducing feeder voltage violations.
LGJun 19, 2021
Graph Neural Networks for Learning Real-Time Prices in Electricity MarketShaohui Liu, Chengyang Wu, Hao Zhu
Solving the optimal power flow (OPF) problem in real-time electricity market improves the efficiency and reliability in the integration of low-carbon energy resources into the power grids. To address the scalability and adaptivity issues of existing end-to-end OPF learning solutions, we propose a new graph neural network (GNN) framework for predicting the electricity market prices from solving OPFs. The proposed GNN-for-OPF framework innovatively exploits the locality property of prices and introduces physics-aware regularization, while attaining reduced model complexity and fast adaptivity to varying grid topology. Numerical tests have validated the learning efficiency and adaptivity improvements of our proposed method over existing approaches.