CVJul 5, 2022Code
OSFormer: One-Stage Camouflaged Instance Segmentation with TransformersJialun Pei, Tianyang Cheng, Deng-Ping Fan et al.
We present OSFormer, the first one-stage transformer framework for camouflaged instance segmentation (CIS). OSFormer is based on two key designs. First, we design a location-sensing transformer (LST) to obtain the location label and instance-aware parameters by introducing the location-guided queries and the blend-convolution feedforward network. Second, we develop a coarse-to-fine fusion (CFF) to merge diverse context information from the LST encoder and CNN backbone. Coupling these two components enables OSFormer to efficiently blend local features and long-range context dependencies for predicting camouflaged instances. Compared with two-stage frameworks, our OSFormer reaches 41% AP and achieves good convergence efficiency without requiring enormous training data, i.e., only 3,040 samples under 60 epochs. Code link: https://github.com/PJLallen/OSFormer.
CVJul 16, 2023Code
CalibNet: Dual-branch Cross-modal Calibration for RGB-D Salient Instance SegmentationJialun Pei, Tao Jiang, He Tang et al.
We propose a novel approach for RGB-D salient instance segmentation using a dual-branch cross-modal feature calibration architecture called CalibNet. Our method simultaneously calibrates depth and RGB features in the kernel and mask branches to generate instance-aware kernels and mask features. CalibNet consists of three simple modules, a dynamic interactive kernel (DIK) and a weight-sharing fusion (WSF), which work together to generate effective instance-aware kernels and integrate cross-modal features. To improve the quality of depth features, we incorporate a depth similarity assessment (DSA) module prior to DIK and WSF. In addition, we further contribute a new DSIS dataset, which contains 1,940 images with elaborate instance-level annotations. Extensive experiments on three challenging benchmarks show that CalibNet yields a promising result, i.e., 58.0% AP with 320*480 input size on the COME15K-N test set, which significantly surpasses the alternative frameworks. Our code and dataset are available at: https://github.com/PJLallen/CalibNet.
CVJul 26, 2023
Unite-Divide-Unite: Joint Boosting Trunk and Structure for High-accuracy Dichotomous Image SegmentationJialun Pei, Zhangjun Zhou, Yueming Jin et al.
High-accuracy Dichotomous Image Segmentation (DIS) aims to pinpoint category-agnostic foreground objects from natural scenes. The main challenge for DIS involves identifying the highly accurate dominant area while rendering detailed object structure. However, directly using a general encoder-decoder architecture may result in an oversupply of high-level features and neglect the shallow spatial information necessary for partitioning meticulous structures. To fill this gap, we introduce a novel Unite-Divide-Unite Network (UDUN} that restructures and bipartitely arranges complementary features to simultaneously boost the effectiveness of trunk and structure identification. The proposed UDUN proceeds from several strengths. First, a dual-size input feeds into the shared backbone to produce more holistic and detailed features while keeping the model lightweight. Second, a simple Divide-and-Conquer Module (DCM) is proposed to decouple multiscale low- and high-level features into our structure decoder and trunk decoder to obtain structure and trunk information respectively. Moreover, we design a Trunk-Structure Aggregation module (TSA) in our union decoder that performs cascade integration for uniform high-accuracy segmentation. As a result, UDUN performs favorably against state-of-the-art competitors in all six evaluation metrics on overall DIS-TE, i.e., achieving 0.772 weighted F-measure and 977 HCE. Using 1024*1024 input, our model enables real-time inference at 65.3 fps with ResNet-18.
CVOct 27, 2022Code
LeNo: Adversarial Robust Salient Object Detection Networks with Learnable NoiseHe Wang, Lin Wan, He Tang
Pixel-wise prediction with deep neural network has become an effective paradigm for salient object detection (SOD) and achieved remarkable performance. However, very few SOD models are robust against adversarial attacks which are visually imperceptible for human visual attention. The previous work robust saliency (ROSA) shuffles the pre-segmented superpixels and then refines the coarse saliency map by the densely connected conditional random field (CRF). Different from ROSA that relies on various pre- and post-processings, this paper proposes a light-weight Learnable Noise (LeNo) to defend adversarial attacks for SOD models. LeNo preserves accuracy of SOD models on both adversarial and clean images, as well as inference speed. In general, LeNo consists of a simple shallow noise and noise estimation that embedded in the encoder and decoder of arbitrary SOD networks respectively. Inspired by the center prior of human visual attention mechanism, we initialize the shallow noise with a cross-shaped gaussian distribution for better defense against adversarial attacks. Instead of adding additional network components for post-processing, the proposed noise estimation modifies only one channel of the decoder. With the deeply-supervised noise-decoupled training on state-of-the-art RGB and RGB-D SOD networks, LeNo outperforms previous works not only on adversarial images but also on clean images, which contributes stronger robustness for SOD. Our code is available at https://github.com/ssecv/LeNo.
CVAug 1, 2023Code
Partitioned Saliency Ranking with Dense Pyramid TransformersChengxiao Sun, Yan Xu, Jialun Pei et al.
In recent years, saliency ranking has emerged as a challenging task focusing on assessing the degree of saliency at instance-level. Being subjective, even humans struggle to identify the precise order of all salient instances. Previous approaches undertake the saliency ranking by directly sorting the rank scores of salient instances, which have not explicitly resolved the inherent ambiguities. To overcome this limitation, we propose the ranking by partition paradigm, which segments unordered salient instances into partitions and then ranks them based on the correlations among these partitions. The ranking by partition paradigm alleviates ranking ambiguities in a general sense, as it consistently improves the performance of other saliency ranking models. Additionally, we introduce the Dense Pyramid Transformer (DPT) to enable global cross-scale interactions, which significantly enhances feature interactions with reduced computational burden. Extensive experiments demonstrate that our approach outperforms all existing methods. The code for our method is available at \url{https://github.com/ssecv/PSR}.
CVJul 9, 2024
CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object DetectionShuang Hao, Chunlin Zhong, He Tang
The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core components. 1) Language-driven Quality Assessment (LQA): Leveraging a pretrained vision-language model with a prompt learner, the LQA recalibrates image contributions without requiring additional quality annotations. This approach effectively mitigates the impact of noisy inputs. 2) Conditional Dropout (CD): A learning method to strengthen the model's adaptability in scenarios with missing modalities, while preserving its performance under complete modalities. The CD serves as a plug-in training scheme that treats modality-missing as conditions, strengthening the overall robustness of various dual-modal SOD models. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions. We will release source code upon acceptance.
CVDec 14, 2024Code
Rethinking Detecting Salient and Camouflaged Objects in Unconstrained ScenesZhangjun Zhou, Yiping Li, Chunlin Zhong et al.
While the human visual system employs distinct mechanisms to perceive salient and camouflaged objects, existing models struggle to disentangle these tasks. Specifically, salient object detection (SOD) models frequently misclassify camouflaged objects as salient, while camouflaged object detection (COD) models conversely misinterpret salient objects as camouflaged. We hypothesize that this can be attributed to two factors: (i) the specific annotation paradigm of current SOD and COD datasets, and (ii) the lack of explicit attribute relationship modeling in current models. Prevalent SOD/COD datasets enforce a mutual exclusivity constraint, assuming scenes contain either salient or camouflaged objects, which poorly aligns with the real world. Furthermore, current SOD/COD methods are primarily designed for these highly constrained datasets and lack explicit modeling of the relationship between salient and camouflaged objects. In this paper, to promote the development of unconstrained salient and camouflaged object detection, we construct a large-scale dataset, USC12K, which features comprehensive labels and four different scenes that cover all possible logical existence scenarios of both salient and camouflaged objects. To explicitly model the relationship between salient and camouflaged objects, we propose a model called USCNet, which introduces two distinct prompt query mechanisms for modeling inter-sample and intra-sample attribute relationships. Additionally, to assess the model's ability to distinguish between salient and camouflaged objects, we design an evaluation metric called CSCS. The proposed method achieves state-of-the-art performance across all scenes in various metrics. The code and dataset will be available at https://github.com/ssecv/USCNet.
LGJan 3, 2024
Synthetic Data in AI: Challenges, Applications, and Ethical ImplicationsShuang Hao, Wenfeng Han, Tao Jiang et al.
In the rapidly evolving field of artificial intelligence, the creation and utilization of synthetic datasets have become increasingly significant. This report delves into the multifaceted aspects of synthetic data, particularly emphasizing the challenges and potential biases these datasets may harbor. It explores the methodologies behind synthetic data generation, spanning traditional statistical models to advanced deep learning techniques, and examines their applications across diverse domains. The report also critically addresses the ethical considerations and legal implications associated with synthetic datasets, highlighting the urgent need for mechanisms to ensure fairness, mitigate biases, and uphold ethical standards in AI development.
CVNov 27, 2024
MotionCharacter: Identity-Preserving and Motion Controllable Human Video GenerationHaopeng Fang, Di Qiu, Binjie Mao et al.
Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.
CVApr 14, 2024
Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating RoomsDiandian Guo, Manxi Lin, Jialun Pei et al.
A comprehensive understanding of surgical scenes allows for monitoring of the surgical process, reducing the occurrence of accidents and enhancing efficiency for medical professionals. Semantic modeling within operating rooms, as a scene graph generation (SGG) task, is challenging since it involves consecutive recognition of subtle surgical actions over prolonged periods. To address this challenge, we propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR. Diverging from previous approaches that integrated temporal information via memory graphs, our method embraces two advantages: 1) we directly exploit bi-modal temporal information from the video streaming for hierarchical feature interaction, and 2) the prior knowledge from Large Language Models (LLMs) is embedded to alleviate the class-imbalance problem in the operating theatre. Specifically, our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp). Furthermore, we transfer knowledge from the biomedical LLM, LLaVA-Med, to deepen the comprehension of intraoperative relations. The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations so as to generate scene graphs. Experimental results on the 4D-OR benchmark demonstrate the superior performance of our model for long-term OR streaming.
CVAug 26, 2025
OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence RewardChunlin Zhong, Qiuxia Hou, Zhangjun Zhou et al.
Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.
CVFeb 28, 2025
PathVG: A New Benchmark and Dataset for Pathology Visual GroundingChunlin Zhong, Shuang Hao, Junhua Wu et al.
With the rapid development of computational pathology, many AI-assisted diagnostic tasks have emerged. Cellular nuclei segmentation can segment various types of cells for downstream analysis, but it relies on predefined categories and lacks flexibility. Moreover, pathology visual question answering can perform image-level understanding but lacks region-level detection capability. To address this, we propose a new benchmark called Pathology Visual Grounding (PathVG), which aims to detect regions based on expressions with different attributes. To evaluate PathVG, we create a new dataset named RefPath which contains 27,610 images with 33,500 language-grounded boxes. Compared to visual grounding in other domains, PathVG presents pathological images at multi-scale and contains expressions with pathological knowledge. In the experimental study, we found that the biggest challenge was the implicit information underlying the pathological expressions. Based on this, we proposed Pathology Knowledge-enhanced Network (PKNet) as the baseline model for PathVG. PKNet leverages the knowledge-enhancement capabilities of Large Language Models (LLMs) to convert pathological terms with implicit information into explicit visual features, and fuses knowledge features with expression features through the designed Knowledge Fusion Module (KFM). The proposed method achieves state-of-the-art performance on the PathVG benchmark.
CVAug 19, 2020
Salient Instance Segmentation with Region and Box-level AnnotationsJialun Pei, He Tang, Tianyang Cheng et al.
Salient instance segmentation is a new challenging task that received widespread attention in the saliency detection area. The new generation of saliency detection provides a strong theoretical and technical basis for video surveillance. Due to the limited scale of the existing dataset and the high mask annotations cost, plenty of supervision source is urgently needed to train a well-performing salient instance model. In this paper, we aim to train a novel salient instance segmentation framework by an inexact supervision without resorting to laborious labeling. To this end, we present a cyclic global context salient instance segmentation network (CGCNet), which is supervised by the combination of salient regions and bounding boxes from the ready-made salient object detection datasets. To locate salient instance more accurately, a global feature refining layer is proposed that dilates the features of the region of interest (ROI) to the global context in a scene. Meanwhile, a labeling updating scheme is embedded in the proposed framework to update the coarse-grained labels for next iteration. Experiment results demonstrate that the proposed end-to-end framework trained by inexact supervised annotations can be competitive to the existing fully supervised salient instance segmentation methods. Without bells and whistles, our proposed method achieves a mask AP of 58.3% in the test set of Dataset1K that outperforms the mainstream state-of-the-art methods.
IVOct 21, 2019
Automatic Lumbar Spinal CT Image Segmentation with a Dual Densely Connected U-NetHe Tang, Xiaobing Pei, Shilong Huang et al.
The clinical treatment of degenerative and developmental lumbar spinal stenosis (LSS) is different. Computed tomography (CT) is helpful in distinguishing degenerative and developmental LSS due to its advantage in imaging of osseous and calcified tissues. However, boundaries of the vertebral body, spinal canal and dural sac have low contrast and hard to identify in a CT image, so the diagnosis depends heavily on the knowledge of expert surgeons and radiologists. In this paper, we develop an automatic lumbar spinal CT image segmentation method to assist LSS diagnosis. The main contributions of this paper are the following: 1) a new lumbar spinal CT image dataset is constructed that contains 2393 axial CT images collected from 279 patients, with the ground truth of pixel-level segmentation labels; 2) a dual densely connected U-shaped neural network (DDU-Net) is used to segment the spinal canal, dural sac and vertebral body in an end-to-end manner; 3) DDU-Net is capable of segmenting tissues with large scale-variant, inconspicuous edges (e.g., spinal canal) and extremely small size (e.g., dural sac); and 4) DDU-Net is practical, requiring no image preprocessing such as contrast enhancement, registration and denoising, and the running time reaches 12 FPS. In the experiment, we achieve state-of-the-art performance on the lumbar spinal image segmentation task. We expect that the technique will increase both radiology workflow efficiency and the perceived value of radiology reports for referring clinicians and patients.
CVSep 29, 2019
Salient Instance Segmentation via Subitizing and ClusteringJialun Pei, He Tang, Chao Liu et al.
The goal of salient region detection is to identify the regions of an image that attract the most attention. Many methods have achieved state-of-the-art performance levels on this task. Recently, salient instance segmentation has become an even more challenging task than traditional salient region detection; however, few of the existing methods have concentrated on this underexplored problem. Unlike the existing methods, which usually employ object proposals to roughly count and locate object instances, our method applies salient objects subitizing to predict an accurate number of instances for salient instance segmentation. In this paper, we propose a multitask densely connected neural network (MDNN) to segment salient instances in an image. In contrast to existing approaches, our framework is proposal-free and category-independent. The MDNN contains two parallel branches: the first is a densely connected subitizing network (DSN) used for subitizing prediction; the second is a densely connected fully convolutional network (DFCN) used for salient region detection. The MDNN simultaneously outputs saliency maps and salient object subitizing. Then, an adaptive deep feature-based spectral clustering operation segments the salient regions into instances based on the subitizing and saliency maps. The experimental results on both salient region detection and salient instance segmentation datasets demonstrate the satisfactory performance of our framework. Notably, its APr@0.5 and Apr@0.7 reaches 73.46% and 60.14% in the salient instance dataset, substantially higher than the results achieved by the state-of-the-art algorithm.