CVMar 1, 2022Code
Adversarial samples for deep monocular 6D object pose estimationJinlai Zhang, Weiming Li, Shuang Liang et al.
Estimating 6D object pose from an RGB image is important for many real-world applications such as autonomous driving and robotic grasping. Recent deep learning models have achieved significant progress on this task but their robustness received little research attention. In this work, for the first time, we study adversarial samples that can fool deep learning models with imperceptible perturbations to input image. In particular, we propose a Unified 6D pose estimation Attack, namely U6DA, which can successfully attack several state-of-the-art (SOTA) deep learning models for 6D pose estimation. The key idea of our U6DA is to fool the models to predict wrong results for object instance localization and shape that are essential for correct 6D pose estimation. Specifically, we explore a transfer-based black-box attack to 6D pose estimation. We design the U6DA loss to guide the generation of adversarial examples, the loss aims to shift the segmentation attention map away from its original position. We show that the generated adversarial samples are not only effective for direct 6D pose estimation models, but also are able to attack two-stage models regardless of their robust RANSAC modules. Extensive experiments were conducted to demonstrate the effectiveness, transferability, and anti-defense capability of our U6DA on large-scale public benchmarks. We also introduce a new U6DA-Linemod dataset for robustness study of the 6D pose estimation task. Our codes and dataset will be available at \url{https://github.com/cuge1995/U6DA}.
CVSep 25, 2023
DVI-SLAM: A Dual Visual Inertial SLAM NetworkXiongfeng Peng, Zhihua Liu, Weiming Li et al.
Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45.3% and 36.2% respectively.
CVMar 30, 2023
NN-Copula-CD: A Copula-Guided Interpretable Neural Network for Change Detection in Heterogeneous Remote Sensing ImagesWeiming Li, Xueqian Wang, Gang Li et al.
Change detection (CD) in heterogeneous remote sensing images has been widely used for disaster monitoring and land-use management. In the past decade, the heterogeneous CD problem has significantly benefited from the development of deep neural networks (DNNs). However, the purely data-driven DNNs perform like a black box where the lack of interpretability limits the trustworthiness and controllability of DNNs in most practical CD applications. As a powerful knowledge-driven tool, copula theory performs well in modeling relationships among random variables. To enhance the interpretability of existing neural networks for CD, we propose a knowledge-data-driven heterogeneous CD method based on a copula-guided neural network, named NN-Copula-CD. In our NN-Copula-CD, the mathematical characteristics of copula are employed as the loss functions to supervise a neural network to learn the dependence between bi-temporal heterogeneous superpixel pairs, and then the changed regions are identified via binary classification based on the degrees of dependence of all the superpixel pairs in the bi-temporal images. We conduct in-depth experiments on three datasets with heterogeneous images, where both quantitative and visual results demonstrate the effectiveness of our proposed NN-Copula-CD method.
CVOct 14, 2022
MCTNet: A Multi-Scale CNN-Transformer Network for Change Detection in Optical Remote Sensing ImagesWeiming Li, Lihui Xue, Xueqian Wang et al.
For the task of change detection (CD) in remote sensing images, deep convolution neural networks (CNNs)-based methods have recently aggregated transformer modules to improve the capability of global feature extraction. However, they suffer degraded CD performance on small changed areas due to the simple single-scale integration of deep CNNs and transformer modules. To address this issue, we propose a hybrid network based on multi-scale CNN-transformer structure, termed MCTNet, where the multi-scale global and local information are exploited to enhance the robustness of the CD performance on changed areas with different sizes. Especially, we design the ConvTrans block to adaptively aggregate global features from transformer modules and local features from CNN layers, which provides abundant global-local features with different scales. Experimental results demonstrate that our MCTNet achieves better detection performance than existing state-of-the-art CD methods.
CLJul 8, 2024
Large Language Models Understand LayoutWeiming Li, Manni Duan, Dong An et al.
Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
LGOct 13, 2025Code
Query-Specific GNN: A Comprehensive Graph Representation Learning Method for Retrieval Augmented GenerationYuchen Yan, Zhihua Liu, Hao Wang et al.
Retrieval-augmented generation (RAG) has demonstrated its ability to enhance Large Language Models (LLMs) by integrating external knowledge sources. However, multi-hop questions, which require the identification of multiple knowledge targets to form a synthesized answer, raise new challenges for RAG systems. Under the multi-hop settings, existing methods often struggle to fully understand the questions with complex semantic structures and are susceptible to irrelevant noise during the retrieval of multiple information targets. To address these limitations, we propose a novel graph representation learning framework for multi-hop question retrieval. We first introduce a Multi-information Level Knowledge Graph (Multi-L KG) to model various information levels for a more comprehensive understanding of multi-hop questions. Based on this, we design a Query-Specific Graph Neural Network (QSGNN) for representation learning on the Multi-L KG. QSGNN employs intra/inter-level message passing mechanisms, and in each message passing the information aggregation is guided by the query, which not only facilitates multi-granular information aggregation but also significantly reduces the impact of noise. To enhance its ability to learn robust representations, we further propose two synthesized data generation strategies for pre-training the QSGNN. Extensive experimental results demonstrate the effectiveness of our framework in multi-hop scenarios, especially in high-hop questions the improvement can reach 33.8\%. The code is available at: https://github.com/Jerry2398/QSGNN.
CVSep 15, 2025Code
How Auxiliary Reasoning Unleashes GUI Grounding in VLMsWeiming Li, Yan Shao, Jing Yang et al.
Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.
CVMar 25, 2024Code
DOCTR: Disentangled Object-Centric Transformer for Point Scene UnderstandingXiaoxuan Yu, Hao Wang, Weiming Li et al.
Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.
CVApr 25, 2021Code
3D Adversarial Attacks Beyond Point CloudJinlai Zhang, Lyujie Chen, Binbin Liu et al.
Recently, 3D deep learning models have been shown to be susceptible to adversarial attacks like their 2D counterparts. Most of the state-of-the-art (SOTA) 3D adversarial attacks perform perturbation to 3D point clouds. To reproduce these attacks in the physical scenario, a generated adversarial 3D point cloud need to be reconstructed to mesh, which leads to a significant drop in its adversarial effect. In this paper, we propose a strong 3D adversarial attack named Mesh Attack to address this problem by directly performing perturbation on mesh of a 3D object. In order to take advantage of the most effective gradient-based attack, a differentiable sample module that back-propagate the gradient of point cloud to mesh is introduced. To further ensure the adversarial mesh examples without outlier and 3D printable, three mesh losses are adopted. Extensive experiments demonstrate that the proposed scheme outperforms SOTA 3D attacks by a significant margin. We also achieved SOTA performance under various defenses. Our code is available at: https://github.com/cuge1995/Mesh-Attack.
CVFeb 5, 2025
MapFusion: A Novel BEV Feature Fusion Network for Multi-modal Map ConstructionXiaoshuai Hao, Yunfeng Diao, Mengchuan Wei et al.
Map construction task plays a vital role in providing precise and comprehensive static environmental information essential for autonomous driving systems. Primary sensors include cameras and LiDAR, with configurations varying between camera-only, LiDAR-only, or camera-LiDAR fusion, based on cost-performance considerations. While fusion-based methods typically perform best, existing approaches often neglect modality interaction and rely on simple fusion strategies, which suffer from the problems of misalignment and information loss. To address these issues, we propose MapFusion, a novel multi-modal Bird's-Eye View (BEV) feature fusion method for map construction. Specifically, to solve the semantic misalignment problem between camera and LiDAR BEV features, we introduce the Cross-modal Interaction Transform (CIT) module, enabling interaction between two BEV feature spaces and enhancing feature representation through a self-attention mechanism. Additionally, we propose an effective Dual Dynamic Fusion (DDF) module to adaptively select valuable information from different modalities, which can take full advantage of the inherent information between different modalities. Moreover, MapFusion is designed to be simple and plug-and-play, easily integrated into existing pipelines. We evaluate MapFusion on two map construction tasks, including High-definition (HD) map and BEV map segmentation, to show its versatility and effectiveness. Compared with the state-of-the-art methods, MapFusion achieves 3.6% and 6.2% absolute improvements on the HD map construction and BEV map segmentation tasks on the nuScenes dataset, respectively, demonstrating the superiority of our approach.
CVMay 3, 2024
Lightweight Change Detection in Heterogeneous Remote Sensing Images with Online All-Integer Pruning TrainingChengyang Zhang, Weiming Li, Gang Li et al.
Detection of changes in heterogeneous remote sensing images is vital, especially in response to emergencies like earthquakes and floods. Current homogenous transformation-based change detection (CD) methods often suffer from high computation and memory costs, which are not friendly to edge-computation devices like onboard CD devices at satellites. To address this issue, this paper proposes a new lightweight CD method for heterogeneous remote sensing images that employs the online all-integer pruning (OAIP) training strategy to efficiently fine-tune the CD network using the current test data. The proposed CD network consists of two visual geometry group (VGG) subnetworks as the backbone architecture. In the OAIP-based training process, all the weights, gradients, and intermediate data are quantized to integers to speed up training and reduce memory usage, where the per-layer block exponentiation scaling scheme is utilized to reduce the computation errors of network parameters caused by quantization. Second, an adaptive filter-level pruning method based on the L1-norm criterion is employed to further lighten the fine-tuning process of the CD network. Experimental results show that the proposed OAIP-based method attains similar detection performance (but with significantly reduced computation complexity and memory usage) in comparison with state-of-the-art CD methods.
AIJul 10, 2025
MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous MachinesLu Xu, Jiaqian Yu, Xiongfeng Peng et al.
To meet the growing demand for smarter, faster, and more efficient embodied AI solutions, we introduce a novel Mixture-of-Expert (MoE) method that significantly boosts reasoning and learning efficiency for embodied autonomous systems. General MoE models demand extensive training data and complex optimization, which limits their applicability in embodied AI such as autonomous driving (AD) and robotic manipulation. In this work, we propose a skill-oriented MoE called MoSE, which mimics the human learning and reasoning process skill-by-skill, step-by-step. We introduce a skill-oriented routing mechanism that begins with defining and annotating specific skills, enabling experts to identify the necessary competencies for various scenarios and reasoning tasks, thereby facilitating skill-by-skill learning. To better align with multi-step planning in human reasoning and in end-to-end driving models, we build a hierarchical skill dataset and pretrain the router to encourage the model to think step-by-step. Unlike other multi-round dialogues, MoSE integrates valuable auxiliary tasks (e.g. perception-prediction-planning for AD, and high-level and low-level planning for robots) in one single forward process without introducing any extra computational cost. With less than 3B sparsely activated parameters, our model effectively grows more diverse expertise and outperforms models on both AD corner-case reasoning tasks and robot reasoning tasks with less than 40% of the parameters.
CLMar 25, 2025
DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific ChartsYujing Lu, Ling Zhong, Jing Yang et al.
Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data. However, existing benchmarks mostly test surface-level parsing, such as reading labels and legends, while overlooking deeper scientific reasoning. We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning. It integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, DomainCQA yields AstroChart, a benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks. Pilot QA sets in biochemistry, economics, medicine, and social science further demonstrate DomainCQA's generality. Together, our results establish DomainCQA as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks.
ROJun 18, 2024
Is Your HD Map Constructor Reliable under Sensor Corruptions?Xiaoshuai Hao, Mengchuan Wei, Yifan Yang et al.
Driving systems often rely on high-definition (HD) maps for precise environmental information, which is crucial for planning and navigation. While current HD map constructors perform well under ideal conditions, their resilience to real-world challenges, \eg, adverse weather and sensor failures, is not well understood, raising safety concerns. This work introduces MapBench, the first comprehensive benchmark designed to evaluate the robustness of HD map construction methods against various sensor corruptions. Our benchmark encompasses a total of 29 types of corruptions that occur from cameras and LiDAR sensors. Extensive evaluations across 31 HD map constructors reveal significant performance degradation of existing methods under adverse weather conditions and sensor failures, underscoring critical safety concerns. We identify effective strategies for enhancing robustness, including innovative approaches that leverage multi-modal fusion, advanced data augmentation, and architectural techniques. These insights provide a pathway for developing more reliable HD map construction methods, which are essential for the advancement of autonomous driving technology. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.
LGMar 21, 2024
Rethinking Adversarial Inverse Reinforcement Learning: Policy Imitation, Transferable Reward Recovery and Algebraic Equilibrium ProofYangchun Zhang, Qiang Liu, Weiming Li et al.
Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning, yet it faces criticisms from prior studies. In this paper, we rethink AIRL and respond to these criticisms. Criticism 1 lies in Inadequate Policy Imitation. We show that substituting the built-in algorithm with soft actor-critic (SAC) during policy updating (requires multi-iterations) significantly enhances the efficiency of policy imitation. Criticism 2 lies in Limited Performance in Transferable Reward Recovery Despite SAC Integration. While we find that SAC indeed exhibits a significant improvement in policy imitation, it introduces drawbacks to transferable reward recovery. We prove that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for a satisfactory transfer effect. Criticism 3 lies in Unsatisfactory Proof from the Perspective of Potential Equilibrium. We reanalyze it from an algebraic theory perspective.
CVMar 20, 2024
DOR3D-Net: Dense Ordinal Regression Network for 3D Hand Pose EstimationYamin Mao, Zhihua Liu, Weiming Li et al.
Depth-based 3D hand pose estimation is an important but challenging research task in human-machine interaction community. Recently, dense regression methods have attracted increasing attention in 3D hand pose estimation task, which provide a low computational burden and high accuracy regression way by densely regressing hand joint offset maps. However, large-scale regression offset values are often affected by noise and outliers, leading to a significant drop in accuracy. To tackle this, we re-formulate 3D hand pose estimation as a dense ordinal regression problem and propose a novel Dense Ordinal Regression 3D Pose Network (DOR3D-Net). Specifically, we first decompose offset value regression into sub-tasks of binary classifications with ordinal constraints. Then, each binary classifier can predict the probability of a binary spatial relationship relative to joint, which is easier to train and yield much lower level of noise. The estimated hand joint positions are inferred by aggregating the ordinal regression results at local positions with a weighted sum. Furthermore, both joint regression loss and ordinal regression loss are used to train our DOR3D-Net in an end-to-end manner. Extensive experiments on public datasets (ICVL, MSRA, NYU and HANDS2017) show that our design provides significant improvements over SOTA methods.
COMP-PHOct 14, 2018
A Unified Gas-kinetic Particle Method for Multiscale Photon TransportWeiming Li, Chang Liu, Yajun Zhu et al.
In this work, we present a unified gas-kinetic particle (UGKP) method for the simulation of multiscale photon transport. The multiscale nature of the particle method mainly comes from the recovery of the time evolution flux function in the unified gas-kinetic scheme (UGKS) through a coupled dynamic process of particle transport and collision. This practice improves the original operator splitting approach in the Monte Carlo method, such as the separated treatment of particle transport and collision. As a result, with the variation of the ratio between numerical time step and local photon's collision time, different transport physics can be fully captured in a single computation. In the diffusive limit, the UGKP method could recover the solution of the diffusion equation with the cell size and time step being much larger than the photon's mean free path and the mean collision time. In the free transport limit, it presents an exact particle tracking process as the original Monte Carlo method. In the transition regime, the weights of particle free transport and collision are determined by the ratio of local numerical time step to the photon's collision time. Several one-dimensional numerical examples covering all transport regimes from the optically thin to optically thick are computed to validate the accuracy and efficiency of the current scheme. In comparison with the $S_N$ discrete ordinate method, the UGKP method is based on particles and avoids the discretization of particle velocity space, which does not suffer from the ray effect.
NAMay 30, 2017
3D B_2 Model for Radiative Transfer Equation Part I: ModellingRuo Li, Weiming Li
We extend to three-dimensional space the approximate M_2 model for the slab geometry studied in our previous paper. The B_2 model therein, as a special case of the second order extended quadrature method of moments (EQMOM), is proved to be globally hyperbolic. The model we proposed here extends EQMOM to multiple dimensions following the idea to approximate the maximum entropy closure for the slab geometry case. Like the M_2 closure, the ansatz of the new model has the capacity to capture both isotropic and beam-like solutions, while the new model has fluxes in closed-form, thus is applicable to practical numerical simulations. The rotational invariance, realizability, and hyperbolicity of the model are studied.