CVAug 12, 2022Code
Semantic decomposition Network with Contrastive and Structural Constraints for Dental Plaque SegmentationJian Shi, Baoli Sun, Xinchen Ye et al.
Segmenting dental plaque from images of medical reagent staining provides valuable information for diagnosis and the determination of follow-up treatment plan. However, accurate dental plaque segmentation is a challenging task that requires identifying teeth and dental plaque subjected to semantic-blur regions (i.e., confused boundaries in border regions between teeth and dental plaque) and complex variations of instance shapes, which are not fully addressed by existing methods. Therefore, we propose a semantic decomposition network (SDNet) that introduces two single-task branches to separately address the segmentation of teeth and dental plaque and designs additional constraints to learn category-specific features for each branch, thus facilitating the semantic decomposition and improving the performance of dental plaque segmentation. Specifically, SDNet learns two separate segmentation branches for teeth and dental plaque in a divide-and-conquer manner to decouple the entangled relation between them. Each branch that specifies a category tends to yield accurate segmentation. To help these two branches better focus on category-specific features, two constraint modules are further proposed: 1) contrastive constraint module (CCM) to learn discriminative feature representations by maximizing the distance between different category representations, so as to reduce the negative impact of semantic-blur regions on feature extraction; 2) structural constraint module (SCM) to provide complete structural information for dental plaque of various shapes by the supervision of an boundary-aware geometric constraint. Besides, we construct a large-scale open-source Stained Dental Plaque Segmentation dataset (SDPSeg), which provides high-quality annotations for teeth and dental plaque. Experimental results on SDPSeg datasets show SDNet achieves state-of-the-art performance.
CVJul 29, 2022
Fine-grained Retrieval Prompt TuningShijie Wang, Jianlong Chang, Zhihui Wang et al.
Fine-grained object retrieval aims to learn discriminative representation to retrieve visually similar objects. However, existing top-performing works usually impose pairwise similarities on the semantic embedding spaces or design a localization sub-network to continually fine-tune the entire model in limited data scenarios, thus resulting in convergence to suboptimal solutions. In this paper, we develop Fine-grained Retrieval Prompt Tuning (FRPT), which steers a frozen pre-trained model to perform the fine-grained retrieval task from the perspectives of sample prompting and feature adaptation. Specifically, FRPT only needs to learn fewer parameters in the prompt and adaptation instead of fine-tuning the entire model, thus solving the issue of convergence to suboptimal solutions caused by fine-tuning the entire model. Technically, a discriminative perturbation prompt (DPP) is introduced and deemed as a sample prompting process, which amplifies and even exaggerates some discriminative elements contributing to category prediction via a content-aware inhomogeneous sampling operation. In this way, DPP can make the fine-grained retrieval task aided by the perturbation prompts close to the solved task during the original pre-training. Thereby, it preserves the generalization and discrimination of representation extracted from input samples. Besides, a category-specific awareness head is proposed and regarded as feature adaptation, which removes the species discrepancies in features extracted by the pre-trained model using category-guided instance normalization. And thus, it makes the optimized features only include the discrepancies among subcategories. Extensive experiments demonstrate that our FRPT with fewer learnable parameters achieves the state-of-the-art performance on three widely-used fine-grained datasets.
CVOct 9, 2023Code
Towards Fair and Comprehensive Comparisons for Image-Based 3D Object DetectionXinzhu Ma, Yongtao Wang, Yinmin Zhang et al.
In this work, we build a modular-designed codebase, formulate strong training recipes, design an error diagnosis toolbox, and discuss current methods for image-based 3D object detection. In particular, different from other highly mature tasks, e.g., 2D object detection, the community of image-based 3D object detection is still evolving, where methods often adopt different training recipes and tricks resulting in unfair evaluations and comparisons. What is worse, these tricks may overwhelm their proposed designs in performance, even leading to wrong conclusions. To address this issue, we build a module-designed codebase and formulate unified training standards for the community. Furthermore, we also design an error diagnosis toolbox to measure the detailed characterization of detection models. Using these tools, we analyze current methods in-depth under varying settings and provide discussions for some open questions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes datasets, which have led to different dominant methods for these datasets. We hope that this work will facilitate future research in image-based 3D object detection. Our codes will be released at \url{https://github.com/OpenGVLab/3dodi}
CVAug 31, 2022
TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based TransformersZengyuan Guo, Yuechen Yu, Pengyuan Lv et al.
Table structure recognition is a crucial part of document image analysis domain. Its difficulty lies in the need to parse the physical coordinates and logical indices of each cell at the same time. However, the existing methods are difficult to achieve both these goals, especially when the table splitting lines are blurred or tilted. In this paper, we propose an accurate and end-to-end transformer-based table structure recognition method, referred to as TRUST. Transformers are suitable for table structure recognition because of their global computations, perfect memory, and parallel computation. By introducing novel Transformer-based Query-based Splitting Module and Vertex-based Merging Module, the table structure recognition problem is decoupled into two joint optimization sub-tasks: multi-oriented table row/column splitting and table grid merging. The Query-based Splitting Module learns strong context information from long dependencies via Transformer networks, accurately predicts the multi-oriented table row/column separators, and obtains the basic grids of the table accordingly. The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately. We conduct experiments on several popular benchmarks including PubTabNet and SynthTable, our method achieves new state-of-the-art results. In particular, TRUST runs at 10 FPS on PubTabNet, surpassing the previous methods by a large margin.
LGOct 12, 2022
Deep Learning-Derived Optimal Aviation Strategies to Control PandemicsSyed Rizvi, Akash Awasthi, Maria J. Peláez et al.
The COVID-19 pandemic has affected countries across the world, demanding drastic public health policies to mitigate the spread of infection, leading to economic crisis as a collateral damage. In this work, we investigated the impact of human mobility (described via international commercial flights) on COVID-19 infection dynamics at the global scale. For this, we developed a graph neural network-based framework referred to as Dynamic Connectivity GraphSAGE (DCSAGE), which operates over spatiotemporal graphs and is well-suited for dynamically changing adjacency information. To obtain insights on the relative impact of different geographical locations, due to their associated air traffic, on the evolution of the pandemic, we conducted local sensitivity analysis on our model through node perturbation experiments. From our analyses, we identified Western Europe, North America, and Middle East as the leading geographical locations fueling the pandemic, attributed to the enormity of air traffic originating or transiting through these regions. We used these observations to identify tangible air traffic reduction strategies that can have a high impact on controlling the pandemic, with minimal interference to human mobility. Our work provides a robust deep learning-based tool to study global pandemics and is of key relevance to policy makers to take informed decisions regarding air traffic restrictions during future outbreaks.
CVAug 28, 2024Code
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency ModelLifan Jiang, Zhihui Wang, Siqi Yin et al.
Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at https://github.com/Tankowa/ConsistencyTrack.
CVMar 26
FD$^2$: A Dedicated Framework for Fine-Grained Dataset DistillationHongxu Ma, Guang Li, Shijie Wang et al.
Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.
CVJan 27
Towards Governance-Oriented Low-Altitude Intelligence: A Management-Centric Multi-Modal Benchmark With Implicitly Coordinated Vision-Language Reasoning FrameworkHao Chang, Zhihui Wang, Lingxiang Wu et al.
Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.
IVAug 5, 2023
Dynamic Dual-Graph Fusion Convolutional Network For Alzheimer's Disease DiagnosisFanshi Li, Zhihui Wang, Yifan Guo et al.
In this paper, a dynamic dual-graph fusion convolutional network is proposed to improve Alzheimer's disease (AD) diagnosis performance. The following are the paper's main contributions: (a) propose a novel dynamic GCN architecture, which is an end-to-end pipeline for diagnosis of the AD task; (b) the proposed architecture can dynamically adjust the graph structure for GCN to produce better diagnosis outcomes by learning the optimal underlying latent graph; (c) incorporate feature graph learning and dynamic graph learning, giving those useful features of subjects more weight while decreasing the weights of other noise features. Experiments indicate that our model provides flexibility and stability while achieving excellent classification results in AD diagnosis.
CVNov 26, 2025
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action RecognitionBaoli Sun, Yihan Wang, Xinzhu Ma et al.
Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.
CVNov 26, 2025
Referring Video Object Segmentation with Cross-Modality Proxy QueriesBaoli Sun, Xinzhu Ma, Ning Wang et al.
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
CVDec 13, 2024Code
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal TokensZhuqiang Lu, Zhenfei Yin, Mengwei He et al.
Recently, Vision Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in vision understanding. The key of VLLMs is to encode visual content into sequences of visual tokens, enabling VLLMs to simultaneously process both visual and textual content. However, understanding videos, especially long videos, remain a challenge to VLLMs as the number of visual tokens grows rapidly when encoding videos, resulting in the risk of exceeding the context window of VLLMs and introducing heavy computation burden. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue the former solution neglects the rich temporal cue in videos and the later overlooks the spatial details in each frame. In this work, we present Balanced-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant spatio-temporal cues while restricting the number of visual tokens under the VLLM context window length. At the core of our method, we devise a text-conditioned adaptive frame selection module to identify frames relevant to the visual understanding task. The selected frames are then de-duplicated using a temporal frame token merging technique. The visual tokens of the selected frames are processed through a spatial token sampling module and an optional spatial token merging strategy to achieve precise control over the token count. Experimental results show that B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding benchmarks. Our code is available at https://github.com/zhuqiangLu/B-VLLM.
AIMar 31Code
Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data AnalysisHan Deng, Anqi Zou, Hanling Zhang et al.
Scientific discovery increasingly depends on high-throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API-based systems. We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts. Its skill-centric framework integrates Type-1 (GUI operation) and Type-2 (data analysis) skills into end-to-end workflows, connecting physical sample handling with scientific interpretation. Owl-AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl-AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at https://github.com/OpenOwlab/AuraID.
CVApr 11, 2024Code
ConsistencyDet: A Few-step Denoising Framework for Object Detection Using the Consistency ModelLifan Jiang, Zhihui Wang, Changmiao Wang et al.
Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on the perturbed bounding boxes of annotated entities. This framework, termed \textbf{ConsistencyDet}, leverages an innovative denoising concept known as the Consistency Model. The hallmark of this model is its self-consistency feature, which empowers the model to map distorted information from any time step back to its pristine state, thereby realizing a \textbf{``few-step denoising''} mechanism. Such an attribute markedly elevates the operational efficiency of the model, setting it apart from the conventional Diffusion Model. Throughout the training phase, ConsistencyDet initiates the diffusion sequence with noise-infused boxes derived from the ground-truth annotations and conditions the model to perform the denoising task. Subsequently, in the inference stage, the model employs a denoising sampling strategy that commences with bounding boxes randomly sampled from a normal distribution. Through iterative refinement, the model transforms an assortment of arbitrarily generated boxes into definitive detections. Comprehensive evaluations employing standard benchmarks, such as MS-COCO and LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in performance metrics. Our code is available at https://anonymous.4open.science/r/ConsistencyDet-37D5.
CVAug 7, 2025Code
Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth CompletionShenglun Chen, Xinzhu Ma, Hong Zhang et al.
Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.
CVMay 23, 2024Code
Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?Chongwei Liu, Haojie Li, Zhihui Wang et al.
Recent advances in Multi-Object Tracking (MOT) have demonstrated significant success in short-term association within the separated tracking-by-detection online paradigm. However, long-term tracking remains challenging. While graph-based approaches address this by modeling trajectories as global graphs, these methods are unsuitable for real-time applications due to their non-online nature. In this paper, we review the concept of trajectory graphs and propose a novel perspective by representing them as directed acyclic graphs. This representation can be described using frame-ordered object sequences and binary adjacency matrices. We observe that this structure naturally aligns with Transformer attention mechanisms, enabling us to model the association problem using a classic Transformer architecture. Based on this insight, we introduce a concise Pure Transformer (PuTR) to validate the effectiveness of Transformer in unifying short- and long-term tracking for separated online MOT. Extensive experiments on four diverse datasets (SportsMOT, DanceTrack, MOT17, and MOT20) demonstrate that PuTR effectively establishes a solid baseline compared to existing foundational online methods while exhibiting superior domain adaptation capabilities. Furthermore, the separated nature enables efficient training and inference, making it suitable for practical applications. Implementation code and trained models are available at https://github.com/chongweiliu/PuTR .
CVJan 26, 2022Code
MonoDistill: Learning Spatial Features for Monocular 3D Object DetectionZhiyu Chong, Xinzhu Ma, Hong Zhang et al.
3D object detection is a fundamental and challenging task for 3D scene understanding, and the monocular-based methods can serve as an economical alternative to the stereo-based or LiDAR-based methods. However, accurately detecting objects in the 3D space from a single image is extremely difficult due to the lack of spatial cues. To mitigate this issue, we propose a simple and effective scheme to introduce the spatial information from LiDAR signals to the monocular 3D detectors, without introducing any extra cost in the inference phase. In particular, we first project the LiDAR signals into the image plane and align them with the RGB images. After that, we use the resulting data to train a 3D detector (LiDAR Net) with the same architecture as the baseline model. Finally, this LiDAR Net can serve as the teacher to transfer the learned knowledge to the baseline model. Experimental results show that the proposed method can significantly boost the performance of the baseline model and ranks the $1^{st}$ place among all monocular-based methods on the KITTI benchmark. Besides, extensive ablation studies are conducted, which further prove the effectiveness of each part of our designs and illustrate what the baseline model has learned from the LiDAR Net. Our code will be released at \url{https://github.com/monster-ghost/MonoDistill}.
CVAug 26, 2021Code
An Underwater Image Semantic Segmentation Method Focusing on Boundaries and a Real Underwater Scene Semantic Segmentation DatasetZhiwei Ma, Haojie Li, Zhihui Wang et al.
With the development of underwater object grabbing technology, underwater object recognition and segmentation of high accuracy has become a challenge. The existing underwater object detection technology can only give the general position of an object, unable to give more detailed information such as the outline of the object, which seriously affects the grabbing efficiency. To address this problem, we label and establish the first underwater semantic segmentation dataset of real scene(DUT-USEG:DUT Underwater Segmentation Dataset). The DUT- USEG dataset includes 6617 images, 1487 of which have semantic segmentation and instance segmentation annotations, and the remaining 5130 images have object detection box annotations. Based on this dataset, we propose a semi-supervised underwater semantic segmentation network focusing on the boundaries(US-Net: Underwater Segmentation Network). By designing a pseudo label generator and a boundary detection subnetwork, this network realizes the fine learning of boundaries between underwater objects and background, and improves the segmentation effect of boundary areas. Experiments show that the proposed method improves by 6.7% in three categories of holothurian, echinus, starfish in DUT-USEG dataset, and achieves state-of-the-art results. The DUT- USEG dataset will be released at https://github.com/baxiyi/DUT-USEG.
CVJul 29, 2021Code
Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object DetectionYinmin Zhang, Xinzhu Ma, Shuai Yi et al.
As a crucial task of autonomous driving, 3D object detection has made great progress in recent years. However, monocular 3D object detection remains a challenging problem due to the unsatisfactory performance in depth estimation. Most existing monocular methods typically directly regress the scene depth while ignoring important relationships between the depth and various geometric elements (e.g. bounding box sizes, 3D object dimensions, and object poses). In this paper, we propose to learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. We further implement and embed the proposed formula to enable geometry-aware deep representation learning, allowing effective 2D and 3D interactions for boosting the depth estimation. Moreover, we provide a strong baseline through addressing substantial misalignment between 2D annotation and projected boxes to ensure robust learning with the proposed geometric formula. Experiments on the KITTI dataset show that our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting. The model and code will be released at https://github.com/YinminZhang/MonoGeo.
CVMar 10, 2021Code
Quality-Aware Network for Human ParsingLu Yang, Qing Song, Zhihui Wang et al.
How to estimate the quality of the network output is an important issue, and currently there is no effective solution in the field of human parsing. In order to solve this problem, this work proposes a statistical method based on the output probability map to calculate the pixel quality information, which is called pixel score. In addition, the Quality-Aware Module (QAM) is proposed to fuse the different quality information, the purpose of which is to estimate the quality of human parsing results. We combine QAM with a concise and effective network design to propose Quality-Aware Network (QANet) for human parsing. Benefiting from the superiority of QAM and QANet, we achieve the best performance on three multiple and one single human parsing benchmarks, including CIHP, MHP-v2, Pascal-Person-Part and LIP. Without increasing the training and inference time, QAM improves the AP$^\text{r}$ criterion by more than 10 points in the multiple human parsing task. QAM can be extended to other tasks with good quality estimation, e.g. instance segmentation. Specifically, QAM improves Mask R-CNN by ~1% mAP on COCO and LVISv1.0 datasets. Based on the proposed QAM and QANet, our overall system wins 1st place in CVPR2019 COCO DensePose Challenge, and 1st place in Track 1 & 2 of CVPR2020 LIP Challenge. Code and models are available at https://github.com/soeaver/QANet.
CVSep 20, 2020Code
Renovating Parsing R-CNN for Accurate Multiple Human ParsingLu Yang, Qing Song, Zhihui Wang et al.
Multiple human parsing aims to segment various human parts and associate each part with the corresponding instance simultaneously. This is a very challenging task due to the diverse human appearance, semantic ambiguity of different body parts, and complex background. Through analysis of multiple human parsing task, we observe that human-centric global perception and accurate instance-level parsing scoring are crucial for obtaining high-quality results. But the most state-of-the-art methods have not paid enough attention to these issues. To reverse this phenomenon, we present Renovating Parsing R-CNN (RP R-CNN), which introduces a global semantic enhanced feature pyramid network and a parsing re-scoring network into the existing high-performance pipeline. The proposed RP R-CNN adopts global semantic representation to enhance multi-scale features for generating human parsing maps, and regresses a confidence score to represent its quality. Extensive experiments show that RP R-CNN performs favorably against state-of-the-art methods on CIHP and MHP-v2 datasets. Code and models are available at https://github.com/soeaver/RP-R-CNN.
CVJan 9, 2024
Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language ModelsDingning Liu, Xiaoshui Huang, Yuenan Hou et al.
In this paper, we introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes. This framework empowers users to effortlessly generate and modify objects at specified locations within a scene, guided by the versatility of natural language descriptions. Uni3D-LLM harnesses the expressive power of natural language to allow for precise command over the generation and editing of 3D objects, thereby significantly enhancing operational flexibility and controllability. By mapping point cloud into the unified representation space, Uni3D-LLM achieves cross-application functionality, enabling the seamless execution of a wide array of tasks, ranging from the accurate instantiation of 3D objects to the diverse requirements of interactive design. Through a comprehensive suite of rigorous experiments, the efficacy of Uni3D-LLM in the comprehension, generation, and editing of point cloud has been validated. Additionally, we have assessed the impact of integrating a point cloud perception module on the generation and editing processes, confirming the substantial potential of our approach for practical applications.
AIDec 15, 2023
3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4VDingning Liu, Xiaomeng Dong, Renrui Zhang et al.
In this work, we present a new visual prompting method called 3DAxiesPrompts (3DAP) to unleash the capabilities of GPT-4V in performing 3D spatial tasks. Our investigation reveals that while GPT-4V exhibits proficiency in discerning the position and interrelations of 2D entities through current visual prompting techniques, its abilities in handling 3D spatial tasks have yet to be explored. In our approach, we create a 3D coordinate system tailored to 3D imagery, complete with annotated scale information. By presenting images infused with the 3DAP visual prompt as inputs, we empower GPT-4V to ascertain the spatial positioning information of the given 3D target image with a high degree of precision. Through experiments, We identified three tasks that could be stably completed using the 3DAP method, namely, 2D to 3D Point Reconstruction, 2D to 3D point matching, and 3D Object Detection. We perform experiments on our proposed dataset 3DAP-Data, the results from these experiments validate the efficacy of 3DAP-enhanced GPT-4V inputs, marking a significant stride in 3D spatial task execution.
SEMar 13, 2025
Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code SummarizationWeisong Sun, Yiran Zhang, Jie Zhu et al.
Commenting code is a crucial activity in software development, as it aids in facilitating future maintenance and updates. To enhance the efficiency of writing comments and reduce developers' workload, researchers has proposed various automated code summarization (ACS) techniques to automatically generate comments/summaries for given code units. However, these ACS techniques primarily focus on generating summaries for code units at the method level. There is a significant lack of research on summarizing higher-level code units, such as file-level and module-level code units, despite the fact that summaries of these higher-level code units are highly useful for quickly gaining a macro-level understanding of software components and architecture. To fill this gap, in this paper, we conduct a systematic study on how to use LLMs for commenting higher-level code units, including file level and module level. These higher-level units are significantly larger than method-level ones, which poses challenges in handling long code inputs within LLM constraints and maintaining efficiency. To address these issues, we explore various summarization strategies for ACS of higher-level code units, which can be divided into three types: full code summarization, reduced code summarization, and hierarchical code summarization. The experimental results suggest that for summarizing file-level code units, using the full code is the most effective approach, with reduced code serving as a cost-efficient alternative. However, for summarizing module-level code units, hierarchical code summarization becomes the most promising strategy. In addition, inspired by the research on method-level ACS, we also investigate using the LLM as an evaluator to evaluate the quality of summaries of higher-level code units. The experimental results demonstrate that the LLM's evaluation results strongly correlate with human evaluations.
CVMar 17, 2025
3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4oDingning Liu, Cheng Wang, Peng Gao et al.
Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object's 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.
LGFeb 27, 2025
SeisMoLLM: Advancing Seismic Monitoring via Cross-modal Transfer with Pre-trained Large Language ModelXinghao Wang, Feng Liu, Rui Su et al.
Recent advances in deep learning have revolutionized seismic monitoring, yet developing a foundation model that performs well across multiple complex tasks remains challenging, particularly when dealing with degraded signals or data scarcity. This work presents SeisMoLLM, the first foundation model that utilizes cross-modal transfer for seismic monitoring, to unleash the power of large-scale pre-training from a large language model without requiring direct pre-training on seismic datasets. Through elaborate waveform tokenization and fine-tuning of pre-trained GPT-2 model, SeisMoLLM achieves state-of-the-art performance on the DiTing and STEAD datasets across five critical tasks: back-azimuth estimation, epicentral distance estimation, magnitude estimation, phase picking, and first-motion polarity classification. It attains 36 best results out of 43 task metrics and 12 top scores out of 16 few-shot generalization metrics, with many relative improvements ranging from 10% to 50%. In addition to its superior performance, SeisMoLLM maintains efficiency comparable to or even better than lightweight models in both training and inference. These findings establish SeisMoLLM as a promising foundation model for practical seismic monitoring and highlight cross-modal transfer as an exciting new direction for earthquake studies, showcasing the potential of advanced deep learning techniques to propel seismology research forward.
LGDec 20, 2023
A self-attention-based differentially private tabular GAN with high data utilityZijian Li, Zhihui Wang
Generative Adversarial Networks (GANs) have become a ubiquitous technology for data generation, with their prowess in image generation being well-established. However, their application in generating tabular data has been less than ideal. Furthermore, attempting to incorporate differential privacy technology into these frameworks has often resulted in a degradation of data utility. To tackle these challenges, this paper introduces DP-SACTGAN, a novel Conditional Generative Adversarial Network (CGAN) framework for differentially private tabular data generation, aiming to surmount these obstacles. Experimental findings demonstrate that DP-SACTGAN not only accurately models the distribution of the original data but also effectively satisfies the requirements of differential privacy.
LGNov 14, 2025
Toward Multi-Fidelity Machine Learning Force Field for Cathode MaterialsGuangyi Dong, Zhihui Wang
Machine learning force fields (MLFFs), which employ neural networks to map atomic structures to system energies, effectively combine the high accuracy of first-principles calculation with the computational efficiency of empirical force fields. They are widely used in computational materials simulations. However, the development and application of MLFFs for lithium-ion battery cathode materials remain relatively limited. This is primarily due to the complex electronic structure characteristics of cathode materials and the resulting scarcity of high-quality computational datasets available for force field training. In this work, we develop a multi-fidelity machine learning force field framework to enhance the data efficiency of computational results, which can simultaneously utilize both low-fidelity non-magnetic and high-fidelity magnetic computational datasets of cathode materials for training. Tests conducted on the lithium manganese iron phosphate (LMFP) cathode material system demonstrate the effectiveness of this multi-fidelity approach. This work helps to achieve high-accuracy MLFF training for cathode materials at a lower training dataset cost, and offers new perspectives for applying MLFFs to computational simulations of cathode materials.
CVFeb 10, 2025
Towards Efficient and Intelligent Laser Weeding: Method and Dataset for Weed Stem DetectionDingning Liu, Jinzhe Li, Haoyang Su et al.
Weed control is a critical challenge in modern agriculture, as weeds compete with crops for essential nutrient resources, significantly reducing crop yield and quality. Traditional weed control methods, including chemical and mechanical approaches, have real-life limitations such as associated environmental impact and efficiency. An emerging yet effective approach is laser weeding, which uses a laser beam as the stem cutter. Although there have been studies that use deep learning in weed recognition, its application in intelligent laser weeding still requires a comprehensive understanding. Thus, this study represents the first empirical investigation of weed recognition for laser weeding. To increase the efficiency of laser beam cut and avoid damaging the crops of interest, the laser beam shall be directly aimed at the weed root. Yet, weed stem detection remains an under-explored problem. We integrate the detection of crop and weed with the localization of weed stem into one end-to-end system. To train and validate the proposed system in a real-life scenario, we curate and construct a high-quality weed stem detection dataset with human annotations. The dataset consists of 7,161 high-resolution pictures collected in the field with annotations of 11,151 instances of weed. Experimental results show that the proposed system improves weeding accuracy by 6.7% and reduces energy cost by 32.3% compared to existing weed recognition systems.
CVMay 9, 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social MediaZhizhen Zhang, Ning Wang, Haojie Li et al.
Semantic location prediction aims to derive meaningful location insights from multimodal social media posts, offering a more contextual understanding of daily activities than using GPS coordinates. This task faces significant challenges due to the noise and modality heterogeneity in "text-image" posts. Existing methods are generally constrained by inadequate feature representations and modal interaction, struggling to effectively reduce noise and modality heterogeneity. To address these challenges, we propose a Similarity-Guided Multimodal Fusion Transformer (SG-MFT) for predicting the semantic locations of users from their multimodal posts. First, we incorporate high-quality text and image representations by utilizing a pre-trained large vision-language model. Then, we devise a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference by incorporating both coarse-grained and fine-grained similarity guidance for improving modality interactions. Specifically, we propose a novel similarity-aware feature interpolation attention mechanism at the coarse-grained level, leveraging modality-wise similarity to mitigate heterogeneity and reduce noise within each modality. At the fine-grained level, we utilize a similarity-aware feed-forward block and element-wise similarity to further address the issue of modality heterogeneity. Finally, building upon pre-processed features with minimal noise and modal interference, we devise a Similarity-aware Fusion Module (SFM) to fuse two modalities with a cross-attention mechanism. Comprehensive experimental results clearly demonstrate the superior performance of our proposed method.
CVJun 16, 2021
2nd Place Solution for Waymo Open Dataset Challenge -- Real-time 2D Object DetectionYueming Zhang, Xiaolin Song, Bing Bai et al.
In an autonomous driving system, it is essential to recognize vehicles, pedestrians and cyclists from images. Besides the high accuracy of the prediction, the requirement of real-time running brings new challenges for convolutional network models. In this report, we introduce a real-time method to detect the 2D objects from images. We aggregate several popular one-stage object detectors and train the models of variety input strategies independently, to yield better performance for accurate multi-scale detection of each category, especially for small objects. For model acceleration, we leverage TensorRT to optimize the inference time of our detection pipeline. As shown in the leaderboard, our proposed detection framework ranks the 2nd place with 75.00% L1 mAP and 69.72% L2 mAP in the real-time 2D detection track of the Waymo Open Dataset Challenges, while our framework achieves the latency of 45.8ms/frame on an Nvidia Tesla V100 GPU.
CVJun 10, 2021
A Dataset And Benchmark Of Underwater Object Detection For Robot PickingChongwei Liu, Haojie Li, Shuchang Wang et al.
Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by addressing the following challenges. Firstly, the currently available datasets basically lack the test set annotations, causing researchers must compare their method with other SOTAs on a self-divided test set (from the training set). Training other methods lead to an increase in workload and different researchers divide different datasets, resulting there is no unified benchmark to compare the performance of different algorithms. Secondly, these datasets also have other shortcomings, e.g., too many similar images or incomplete labels. Towards these challenges we introduce a dataset, Detecting Underwater Objects (DUO), and a corresponding benchmark, based on the collection and re-annotation of all relevant datasets. DUO contains a collection of diverse underwater images with more rational annotations. The corresponding benchmark provides indicators of both efficiency and accuracy of SOTAs (under the MMDtection framework) for academic research and industrial applications, where JETSON AGX XAVIER is used to assess detector speed to simulate the robot-embedded environment.
CVMar 24, 2021
Learning Scene Structure Guidance via Cross-Task Knowledge Transfer for Single Depth Super-ResolutionBaoli Sun, Xinchen Ye, Baopu Li et al.
Existing color-guided depth super-resolution (DSR) approaches require paired RGB-D data as training samples where the RGB image is used as structural guidance to recover the degraded depth map due to their geometrical similarity. However, the paired data may be limited or expensive to be collected in actual testing environment. Therefore, we explore for the first time to learn the cross-modality knowledge at training stage, where both RGB and depth modalities are available, but test on the target dataset, where only single depth modality exists. Our key idea is to distill the knowledge of scene structural guidance from RGB modality to the single DSR task without changing its network architecture. Specifically, we construct an auxiliary depth estimation (DE) task that takes an RGB image as input to estimate a depth map, and train both DSR task and DE task collaboratively to boost the performance of DSR. Upon this, a cross-task interaction module is proposed to realize bilateral cross task knowledge transfer. First, we design a cross-task distillation scheme that encourages DSR and DE networks to learn from each other in a teacher-student role-exchanging fashion. Then, we advance a structure prediction (SP) task that provides extra structure regularization to help both DSR and DE networks learn more informative structure representations for depth recovery. Extensive experiments demonstrate that our scheme achieves superior performance in comparison with other DSR methods.
LGJan 25, 2021
A Unified Joint Maximum Mean Discrepancy for Domain AdaptationWei Wang, Baopu Li, Shuhui Yang et al.
Domain adaptation has received a lot of attention in recent years, and many algorithms have been proposed with impressive progress. However, it is still not fully explored concerning the joint probability distribution (P(X, Y)) distance for this problem, since its empirical estimation derived from the maximum mean discrepancy (joint maximum mean discrepancy, JMMD) will involve complex tensor-product operator that is hard to manipulate. To solve this issue, this paper theoretically derives a unified form of JMMD that is easy to optimize, and proves that the marginal, class conditional and weighted class conditional probability distribution distances are our special cases with different label kernels, among which the weighted class conditional one not only can realize feature alignment across domains in the category level, but also deal with imbalance dataset using the class prior probabilities. From the revealed unified JMMD, we illustrate that JMMD degrades the feature-label dependence (discriminability) that benefits to classification, and it is sensitive to the label distribution shift when the label kernel is the weighted class conditional one. Therefore, we leverage Hilbert Schmidt independence criterion and propose a novel MMD matrix to promote the dependence, and devise a novel label kernel that is robust to label distribution shift. Finally, we conduct extensive experiments on several cross-domain datasets to demonstrate the validity and effectiveness of the revealed theoretical results.
CVDec 10, 2020
Full Matching on Low Resolution for Disparity EstimationHong Zhang, Shenglun Chen, Zhihui Wang et al.
A Multistage Full Matching disparity estimation scheme (MFM) is proposed in this work. We demonstrate that decouple all similarity scores directly from the low-resolution 4D volume step by step instead of estimating low-resolution 3D cost volume through focusing on optimizing the low-resolution 4D volume iteratively leads to more accurate disparity. To this end, we first propose to decompose the full matching task into multiple stages of the cost aggregation module. Specifically, we decompose the high-resolution predicted results into multiple groups, and every stage of the newly designed cost aggregation module learns only to estimate the results for a group of points. This alleviates the problem of feature internal competitive when learning similarity scores of all candidates from one low-resolution 4D volume output from one stage. Then, we propose the strategy of \emph{Stages Mutual Aid}, which takes advantage of the relationship of multiple stages to boost similarity scores estimation of each stage, to solve the unbalanced prediction of multiple stages caused by serial multistage framework. Experiment results demonstrate that the proposed method achieves more accurate disparity estimation results and outperforms state-of-the-art methods on Scene Flow, KITTI 2012 and KITTI 2015 datasets.
CVDec 10, 2020
Direct Depth Learning Network for Stereo MatchingHong Zhang, Haojie Li, Shenglun Chen et al.
Being a crucial task of autonomous driving, Stereo matching has made great progress in recent years. Existing stereo matching methods estimate disparity instead of depth. They treat the disparity errors as the evaluation metric of the depth estimation errors, since the depth can be calculated from the disparity according to the triangulation principle. However, we find that the error of the depth depends not only on the error of the disparity but also on the depth range of the points. Therefore, even if the disparity error is low, the depth error is still large, especially for the distant points. In this paper, a novel Direct Depth Learning Network (DDL-Net) is designed for stereo matching. DDL-Net consists of two stages: the Coarse Depth Estimation stage and the Adaptive-Grained Depth Refinement stage, which are all supervised by depth instead of disparity. Specifically, Coarse Depth Estimation stage uniformly samples the matching candidates according to depth range to construct cost volume and output coarse depth. Adaptive-Grained Depth Refinement stage performs further matching near the coarse depth to correct the imprecise matching and wrong matching. To make the Adaptive-Grained Depth Refinement stage robust to the coarse depth and adaptive to the depth range of the points, the Granularity Uncertainty is introduced to Adaptive-Grained Depth Refinement stage. Granularity Uncertainty adjusts the matching range and selects the candidates' features according to coarse prediction confidence and depth range. We verify the performance of DDL-Net on SceneFlow dataset and DrivingStereo dataset by different depth metrics. Results show that DDL-Net achieves an average improvement of 25% on the SceneFlow dataset and $12\%$ on the DrivingStereo dataset comparing the classical methods. More importantly, we achieve state-of-the-art accuracy at a large distance.
LGJul 1, 2020
Rethink Maximum Mean Discrepancy for Domain AdaptationWei Wang, Haojie Li, Zhengming Ding et al.
Existing domain adaptation methods aim to reduce the distributional difference between the source and target domains and respect their specific discriminative information, by establishing the Maximum Mean Discrepancy (MMD) and the discriminative distances. However, they usually accumulate to consider those statistics and deal with their relationships by estimating parameters blindly. This paper theoretically proves two essential facts: 1) minimizing the MMD equals to maximize the source and target intra-class distances respectively but jointly minimize their variance with some implicit weights, so that the feature discriminability degrades; 2) the relationship between the intra-class and inter-class distances is as one falls, another rises. Based on this, we propose a novel discriminative MMD. On one hand, we consider the intra-class and inter-class distances alone to remove a redundant parameter, and the revealed weights provide their approximate optimal ranges. On the other hand, we design two different strategies to boost the feature discriminability: 1) we directly impose a trade-off parameter on the implicit intra-class distance in MMD to regulate its change; 2) we impose the similar weights revealed in MMD on inter-class distance and maximize it, then a balanced factor could be introduced to quantitatively leverage the relative importance between the feature transferability and its discriminability. The experiments on several benchmark datasets not only prove the validity of theoretical results but also demonstrate that our approach could perform better than the comparative state-of-art methods substantially.
CVMay 8, 2020
Sparsely-Labeled Source Assisted Domain AdaptationWei Wang, Zhihui Wang, Yuankai Xiang et al.
Domain Adaptation (DA) aims to generalize the classifier learned from the source domain to the target domain. Existing DA methods usually assume that rich labels could be available in the source domain. However, there are usually a large number of unlabeled data but only a few labeled data in the source domain, and how to transfer knowledge from this sparsely-labeled source domain to the target domain is still a challenge, which greatly limits their application in the wild. This paper proposes a novel Sparsely-Labeled Source Assisted Domain Adaptation (SLSA-DA) algorithm to address the challenge with limited labeled source domain samples. Specifically, due to the label scarcity problem, the projected clustering is conducted on both the source and target domains, so that the discriminative structures of data could be leveraged elegantly. Then the label propagation is adopted to propagate the labels from those limited labeled source samples to the whole unlabeled data progressively, so that the cluster labels are revealed correctly. Finally, we jointly align the marginal and conditional distributions to mitigate the cross-domain mismatch problem, and optimize those three procedures iteratively. However, it is nontrivial to incorporate those three procedures into a unified optimization framework seamlessly since some variables to be optimized are implicitly involved in their formulas, thus they could not promote to each other. Remarkably, we prove that the projected clustering and conditional distribution alignment could be reformulated as different expressions, thus the implicit variables are revealed in different optimization steps. As such, the variables related to those three quantities could be optimized in a unified optimization framework and facilitate to each other, to improve the recognition performance obviously.
CVApr 23, 2020
Location-Aware Feature Selection Text Detection NetworkZengyuan Guo, Zilin Wang, Zhihui Wang et al.
Regression-based text detection methods have already achieved promising performances with simple network structure and high efficiency. However, they are behind in accuracy comparing with recent segmentation-based text detectors. In this work, we discover that one important reason to this case is that regression-based methods usually utilize a fixed feature selection way, i.e. selecting features in a single location or in neighbor regions, to predict components of the bounding box, such as the distances to the boundaries or the rotation angle. The features selected through this way sometimes are not the best choices for predicting every component of a text bounding box and thus degrade the accuracy performance. To address this issue, we propose a novel Location-Aware feature Selection text detection Network (LASNet). LASNet selects suitable features from different locations to separately predict the five components of a bounding box and gets the final bounding box through the combination of these components. Specifically, instead of using the classification score map to select one feature for predicting the whole bounding box as most of the existing methods did, the proposed LASNet first learn five new confidence score maps to indicate the prediction accuracy of the bounding box components, respectively. Then, a Location-Aware Feature Selection mechanism (LAFS) is designed to weightily fuse the top-$K$ prediction results for each component according to their confidence score, and to combine the all five fused components into a final bounding box. As a result, LASNet predicts the more accurate bounding boxes by using a learnable feature selection way. The experimental results demonstrate that our LASNet achieves state-of-the-art performance with single-model and single-scale testing, outperforming all existing regression-based detectors.
CVMar 7, 2020
CPM R-CNN: Calibrating Point-guided Misalignment in Object DetectionBin Zhu, Qing Song, Lu Yang et al.
In object detection, offset-guided and point-guided regression dominate anchor-based and anchor-free method separately. Recently, point-guided approach is introduced to anchor-based method. However, we observe points predicted by this way are misaligned with matched region of proposals and score of localization, causing a notable gap in performance. In this paper, we propose CPM R-CNN which contains three efficient modules to optimize anchor-based point-guided method. According to sufficient evaluations on the COCO dataset, CPM R-CNN is demonstrated efficient to improve the localization accuracy by calibrating mentioned misalignment. Compared with Faster R-CNN and Grid R-CNN based on ResNet-101 with FPN, our approach can substantially improve detection mAP by 3.3% and 1.5% respectively without whistles and bells. Moreover, our best model achieves improvement by a large margin to 49.9% on COCO test-dev. Code and models will be publicly available.
CVMar 3, 2020
A New Dataset, Poisson GAN and AquaNet for Underwater Object GrabbingChongwei Liu, Zhihui Wang, Shijie Wang et al.
To boost the object grabbing capability of underwater robots for open-sea farming, we propose a new dataset (UDD) consisting of three categories (seacucumber, seaurchin, and scallop) with 2,227 images. To the best of our knowledge, it is the first 4K HD dataset collected in a real open-sea farm. We also propose a novel Poisson-blending Generative Adversarial Network (Poisson GAN) and an efficient object detection network (AquaNet) to address two common issues within related datasets: the class-imbalance problem and the problem of mass small object, respectively. Specifically, Poisson GAN combines Poisson blending into its generator and employs a new loss called Dual Restriction loss (DR loss), which supervises both implicit space features and image-level features during training to generate more realistic images. By utilizing Poisson GAN, objects of minority class like seacucumber or scallop could be added into an image naturally and annotated automatically, which could increase the loss of minority classes during training detectors to eliminate the class-imbalance problem; AquaNet is a high-efficiency detector to address the problem of detecting mass small objects from cloudy underwater pictures. Within it, we design two efficient components: a depth-wise-convolution-based Multi-scale Contextual Features Fusion (MFF) block and a Multi-scale Blursampling (MBP) module to reduce the parameters of the network to 1.3 million. Both two components could provide multi-scale features of small objects under a short backbone configuration without any loss of accuracy. In addition, we construct a large-scale augmented dataset (AUDD) and a pre-training dataset via Poisson GAN from UDD. Extensive experiments show the effectiveness of the proposed Poisson GAN, AquaNet, UDD, AUDD, and pre-training dataset.
QUANT-PHFeb 23, 2020
Planning for Compilation of a Quantum Algorithm for Graph ColoringMinh Do, Zhihui Wang, Bryan O'Gorman et al.
The problem of compiling general quantum algorithms for implementation on near-term quantum processors has been introduced to the AI community. Previous work demonstrated that temporal planning is an attractive approach for part of this compilationtask, specifically, the routing of circuits that implement the Quantum Alternating Operator Ansatz (QAOA) applied to the MaxCut problem on a quantum processor architecture. In this paper, we extend the earlier work to route circuits that implement QAOA for Graph Coloring problems. QAOA for coloring requires execution of more, and more complex, operations on the chip, which makes routing a more challenging problem. We evaluate the approach on state-of-the-art hardware architectures from leading quantum computing companies. Additionally, we apply a planning approach to qubit initialization. Our empirical evaluation shows that temporal planning compares well to reasonable analytic upper bounds, and that solving qubit initialization with a classical planner generally helps temporal planners in finding shorter-makespan compilations for QAOA for Graph Coloring. These advances suggest that temporal planning can be an effective approach for more complex quantum computing algorithms and architectures.
LGDec 24, 2019
Importance Filtered Cross-Domain AdaptationWei Wang, Haojie Li, Zhihui Wang et al.
In Domain Adaptation (DA), the category-relevant losses usually occupy a dominant position, while they are usually built with hard or soft labels in existing models. We observed that hard labels are overconfident due to hard samples existed, and soft labels are ambiguous as too many small noisy probabilities involved, and both of them are easily to cause negative transfer. Besides, the category-irrelevant losses in Closed-Set DA (CSDA) paradigm fail to work in Open-Set DA (OSDA), and they also have to be in a category-relevant form, since target data samples are split into shared and private classes. To this end, we propose a newly-unified DA framework (i.e., Importance Filtered Cross-Domain Adaptation, IFCDA). Firstly, an importance filtered mechanism is devised to generate filtered soft labels to mitigate negative transfer desirably. Specifically, the soft labels are divided into confident and ambiguous ones. Then, only the maximum probability in each confident label is retained, and a threshold value is set to truncate each ambiguous label so that only prominent probabilities are reserved. Moreover, a general graph-based label propagation is contrived to attain soft labels in both CSDA and OSDA, where an extra component is embedded into label vector, so that it could detect target novel classes. Finally, the category-relevant losses in both scenarios are reformulated using filtered soft labels, while the category-irrelevant MMD loss in CSDA is reformulated as a form like class-wise MMD using newly-designed importance filtered soft labels. Notably, CSDA paradigm is a special case when all extra components are set to 0, thus the proposed approach is geared to both CSDA and OSDA. Comprehensive experiments on benchmark cross-domain object recognition datasets verify that the proposed approach outperforms several state-of-the-art methods in both scenarios.
CVMar 27, 2019
Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous DrivingXinzhu Ma, Zhihui Wang, Haojie Li et al.
In this paper, we propose a monocular 3D object detection framework in the domain of autonomous driving. Unlike previous image-based methods which focus on RGB feature extracted from 2D images, our method solves this problem in the reconstructed 3D space in order to exploit 3D contexts explicitly. To this end, we first leverage a stand-alone module to transform the input data from 2D image plane to 3D point clouds space for a better input representation, then we perform the 3D detection using PointNet backbone net to obtain objects 3D locations, dimensions and orientations. To enhance the discriminative capability of point clouds, we propose a multi-modal feature fusion module to embed the complementary RGB cue into the generated point clouds representation. We argue that it is more effective to infer the 3D bounding boxes from the generated 3D scene space (i.e., X,Y, Z space) compared to the image plane (i.e., R,G,B image plane). Evaluation on the challenging KITTI dataset shows that our approach boosts the performance of state-of-the-art monocular approach by a large margin.
CVNov 30, 2018
Parsing R-CNN for Instance-Level Human AnalysisLu Yang, Qing Song, Zhihui Wang et al.
Instance-level human analysis is common in real-life scenarios and has multiple manifestations, such as human part segmentation, dense pose estimation, human-object interactions, etc. Models need to distinguish different human instances in the image panel and learn rich features to represent the details of each instance. In this paper, we present an end-to-end pipeline for solving the instance-level human analysis, named Parsing R-CNN. It processes a set of human instances simultaneously through comprehensive considering the characteristics of region-based approach and the appearance of a human, thus allowing representing the details of instances. Parsing R-CNN is very flexible and efficient, which is applicable to many issues in human instance analysis. Our approach outperforms all state-of-the-art methods on CIHP (Crowd Instance-level Human Parsing), MHP v2.0 (Multi-Human Parsing) and DensePose-COCO datasets. Based on the proposed Parsing R-CNN, we reach the 1st place in the COCO 2018 Challenge DensePose Estimation task. Code and models are public available.
CVAug 9, 2018
User-Guided Deep Anime Line Art Colorization with Conditional Adversarial NetworksYuanzheng Ci, Xinzhu Ma, Zhihui Wang et al.
Scribble colors based line art colorization is a challenging computer vision problem since neither greyscale values nor semantic information is presented in line arts, and the lack of authentic illustration-line art training pairs also increases difficulty of model generalization. Recently, several Generative Adversarial Nets (GANs) based methods have achieved great success. They can generate colorized illustrations conditioned on given line art and color hints. However, these methods fail to capture the authentic illustration distributions and are hence perceptually unsatisfying in the sense that they often lack accurate shading. To address these challenges, we propose a novel deep conditional adversarial architecture for scribble based anime line art colorization. Specifically, we integrate the conditional framework with WGAN-GP criteria as well as the perceptual loss to enable us to robustly train a deep network that makes the synthesized images more natural and real. We also introduce a local features network that is independent of synthetic data. With GANs conditioned on features from such network, we notably increase the generalization capability over "in the wild" line arts. Furthermore, we collect two datasets that provide high-quality colorful illustrations and authentic line arts for training and benchmarking. With the proposed model trained on our illustration dataset, we demonstrate that images synthesized by the presented approach are considerably more realistic and precise than alternative approaches.
CVJul 5, 2018
A Single Shot Text Detector with Scale-adaptive AnchorsQi Yuan, Bingwang Zhang, Haojie Li et al.
Currently, most top-performing text detection networks tend to employ fixed-size anchor boxes to guide the search for text instances. They usually rely on a large amount of anchors with different scales to discover texts in scene images, thus leading to high computational cost. In this paper, we propose an end-to-end box-based text detector with scale-adaptive anchors, which can dynamically adjust the scales of anchors according to the sizes of underlying texts by introducing an additional scale regression layer. The proposed scale-adaptive anchors allow us to use a few number of anchors to handle multi-scale texts and therefore significantly improve the computational efficiency. Moreover, compared to discrete scales used in previous methods, the learned continuous scales are more reliable, especially for small texts detection. Additionally, we propose Anchor convolution to better exploit necessary feature information by dynamically adjusting the sizes of receptive fields according to the learned scales. Extensive experiments demonstrate that the proposed detector is fast, taking only $0.28$ second per image, while outperforming most state-of-the-art methods in accuracy.