CVOct 25, 2022Code
Synthetic Data Supervised Salient Object DetectionZhenyu Wu, Lin Wang, Wei Wang et al.
Although deep salient object detection (SOD) has achieved remarkable progress, deep SOD models are extremely data-hungry, requiring large-scale pixel-wise annotations to deliver such promising results. In this paper, we propose a novel yet effective method for SOD, coined SODGAN, which can generate infinite high-quality image-mask pairs requiring only a few labeled data, and these synthesized pairs can replace the human-labeled DUTS-TR to train any off-the-shelf SOD model. Its contribution is three-fold. 1) Our proposed diffusion embedding network can address the manifold mismatch and is tractable for the latent code generation, better matching with the ImageNet latent space. 2) For the first time, our proposed few-shot saliency mask generator can synthesize infinite accurate image synchronized saliency masks with a few labeled data. 3) Our proposed quality-aware discriminator can select highquality synthesized image-mask pairs from noisy synthetic data pool, improving the quality of synthetic data. For the first time, our SODGAN tackles SOD with synthetic data directly generated from the generative model, which opens up a new research paradigm for SOD. Extensive experimental results show that the saliency model trained on synthetic data can achieve $98.4\%$ F-measure of the saliency model trained on the DUTS-TR. Moreover, our approach achieves a new SOTA performance in semi/weakly-supervised methods, and even outperforms several fully-supervised SOTA methods. Code is available at https://github.com/wuzhenyubuaa/SODGAN
CVJul 2, 2022Code
Boundary-Guided Camouflaged Object DetectionYujia Sun, Shuo Wang, Chenglizhao Chen et al.
Camouflaged object detection (COD), segmenting objects that are elegantly blended into their surroundings, is a valuable yet challenging task. Existing deep-learning methods often fall into the difficulty of accurately identifying the camouflaged object with complete and fine object structure. To this end, in this paper, we propose a novel boundary-guided network (BGNet) for camouflaged object detection. Our method explores valuable and extra object-related edge semantics to guide representation learning of COD, which forces the model to generate features that highlight object structure, thereby promoting camouflaged object detection of accurate boundary localization. Extensive experiments on three challenging benchmark datasets demonstrate that our BGNet significantly outperforms the existing 18 state-of-the-art methods under four widely-used evaluation metrics. Our code is publicly available at: https://github.com/thograce/BGNet.
CVOct 25, 2022Code
Salient Object Detection via Dynamic Scale RoutingZhenyu Wu, Shuai Li, Chenglizhao Chen et al.
Recent research advances in salient object detection (SOD) could largely be attributed to ever-stronger multi-scale feature representation empowered by the deep learning technologies. The existing SOD deep models extract multi-scale features via the off-the-shelf encoders and combine them smartly via various delicate decoders. However, the kernel sizes in this commonly-used thread are usually "fixed". In our new experiments, we have observed that kernels of small size are preferable in scenarios containing tiny salient objects. In contrast, large kernel sizes could perform better for images with large salient objects. Inspired by this observation, we advocate the "dynamic" scale routing (as a brand-new idea) in this paper. It will result in a generic plug-in that could directly fit the existing feature backbone. This paper's key technical innovations are two-fold. First, instead of using the vanilla convolution with fixed kernel sizes for the encoder design, we propose the dynamic pyramid convolution (DPConv), which dynamically selects the best-suited kernel sizes w.r.t. the given input. Second, we provide a self-adaptive bidirectional decoder design to accommodate the DPConv-based encoder best. The most significant highlight is its capability of routing between feature scales and their dynamic collection, making the inference process scale-aware. As a result, this paper continues to enhance the current SOTA performance. Both the code and dataset are publicly available at https://github.com/wuzhenyubuaa/DPNet.
CVDec 13, 2022
Pixel is All You Need: Adversarial Trajectory-Ensemble Active Learning for Salient Object DetectionZhenyu Wu, Lin Wang, Wei Wang et al.
Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial trajectory-ensemble active learning (ATAL). Our contributions are three-fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. {2)} Our proposed trajectory-ensemble uncertainty estimation method maintains the advantages of the ensemble networks while significantly reducing the computational cost. {3)} Our proposed relationship-aware diversity sampling algorithm can conquer oversampling while boosting performance. Experimental results show that our ATAL can find such a point-labeled dataset, where a saliency model trained on it obtained $97\%$ -- $99\%$ performance of its fully-supervised version with only ten annotated points per image.
CVJun 20, 2022
A Novel Long-term Iterative Mining Scheme for Video Salient Object DetectionChenglizhao Chen, Hengsen Wang, Yuming Fang et al.
The existing state-of-the-art (SOTA) video salient object detection (VSOD) models have widely followed short-term methodology, which dynamically determines the balance between spatial and temporal saliency fusion by solely considering the current consecutive limited frames. However, the short-term methodology has one critical limitation, which conflicts with the real mechanism of our visual system -- a typical long-term methodology. As a result, failure cases keep showing up in the results of the current SOTA models, and the short-term methodology becomes the major technical bottleneck. To solve this problem, this paper proposes a novel VSOD approach, which performs VSOD in a complete long-term way. Our approach converts the sequential VSOD, a sequential task, to a data mining problem, i.e., decomposing the input video sequence to object proposals in advance and then mining salient object proposals as much as possible in an easy-to-hard way. Since all object proposals are simultaneously available, the proposed approach is a complete long-term approach, which can alleviate some difficulties rooted in conventional short-term approaches. In addition, we devised an online updating scheme that can grasp the most representative and trustworthy pattern profile of the salient objects, outputting framewise saliency maps with rich details and smoothing both spatially and temporally. The proposed approach outperforms almost all SOTA models on five widely used benchmark datasets.
LGApr 22, 2022
Log-based Sparse Nonnegative Matrix Factorization for Data RepresentationChong Peng, Yiqun Zhang, Yongyong Chen et al.
Nonnegative matrix factorization (NMF) has been widely studied in recent years due to its effectiveness in representing nonnegative data with parts-based representations. For NMF, a sparser solution implies better parts-based representation.However, current NMF methods do not always generate sparse solutions.In this paper, we propose a new NMF method with log-norm imposed on the factor matrices to enhance the sparseness.Moreover, we propose a novel column-wisely sparse norm, named $\ell_{2,\log}$-(pseudo) norm to enhance the robustness of the proposed method.The $\ell_{2,\log}$-(pseudo) norm is invariant, continuous, and differentiable.For the $\ell_{2,\log}$ regularized shrinkage problem, we derive a closed-form solution, which can be used for other general problems.Efficient multiplicative updating rules are developed for the optimization, which theoretically guarantees the convergence of the objective value sequence.Extensive experimental results confirm the effectiveness of the proposed method, as well as the enhanced sparseness and robustness.
CVJun 20, 2022
A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!Chenglizhao Chen, Mengke Song, Wenfeng Song et al.
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors which could directly determine the performances of AVSD deep models, and we claim that the audio-visual consistency degree (AVC) -- a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, in order to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, both our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which are very potential to facilitate future works in promoting state-of-the-art (SOTA) performance further.
CVSep 5, 2024
Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly DetectionChenglizhao Chen, Xinyu Liu, Mengke Song et al.
Detecting anomalies in human-related videos is crucial for surveillance applications. Current methods primarily include appearance-based and action-based techniques. Appearance-based methods rely on low-level visual features such as color, texture, and shape. They learn a large number of pixel patterns and features related to known scenes during training, making them effective in detecting anomalies within these familiar contexts. However, when encountering new or significantly changed scenes, i.e., unknown scenes, they often fail because existing SOTA methods do not effectively capture the relationship between actions and their surrounding scenes, resulting in low generalization. In contrast, action-based methods focus on detecting anomalies in human actions but are usually less informative because they tend to overlook the relationship between actions and their scenes, leading to incorrect detection. For instance, the normal event of running on the beach and the abnormal event of running on the street might both be considered normal due to the lack of scene information. In short, current methods struggle to integrate low-level visual and high-level action features, leading to poor anomaly detection in varied and complex scenes. To address this challenge, we propose a novel decoupling-based architecture for human-related video anomaly detection (DecoAD). DecoAD significantly improves the integration of visual and action features through the decoupling and interweaving of scenes and actions, thereby enabling a more intuitive and accurate understanding of complex behaviors and scenes. DecoAD supports fully supervised, weakly supervised, and unsupervised settings.
CVMay 16
Controlling Decision Drift in Multimodal Sentiment Analysis with Missing ModalitiesChenglizhao Chen, Yuchen Cao, Xinyu Liu et al.
Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.
CVDec 27, 2021Code
Weakly Supervised Visual-Auditory Fixation Prediction with Multigranularity PerceptionGuotao Wang, Chenglizhao Chen, Deng-Ping Fan et al.
Thanks to the rapid advances in deep learning techniques and the wide availability of large-scale training sets, the performance of video saliency detection models has been improving steadily and significantly. However, deep learning-based visualaudio fixation prediction is still in its infancy. At present, only a few visual-audio sequences have been furnished, with real fixations being recorded in real visual-audio environments. Hence, it would be neither efficient nor necessary to recollect real fixations under the same visual-audio circumstances. To address this problem, this paper promotes a novel approach in a weakly supervised manner to alleviate the demand of large-scale training sets for visual-audio model training. By using only the video category tags, we propose the selective class activation mapping (SCAM) and its upgrade (SCAM+). In the spatial-temporal-audio circumstance, the former follows a coarse-to-fine strategy to select the most discriminative regions, and these regions are usually capable of exhibiting high consistency with the real human-eye fixations. The latter equips the SCAM with an additional multi-granularity perception mechanism, making the whole process more consistent with that of the real human visual system. Moreover, we distill knowledge from these regions to obtain complete new spatial-temporal-audio (STA) fixation prediction (FP) networks, enabling broad applications in cases where video tags are not available. Without resorting to any real human-eye fixation, the performances of these STA FP networks are comparable to those of fully supervised networks. The code and results are publicly available at https://github.com/guotaowang/STANet.
CVJan 21
READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression DetectionChenglizhao Chen, Boze Li, Mengke Song et al.
Depression is a severe global mental health issue that impairs daily functioning and overall quality of life. Although recent audio-visual approaches have improved automatic depression detection, methods that ignore emotional cues often fail to capture subtle depressive signals hidden within emotional expressions. Conversely, those incorporating emotions frequently confuse transient emotional expressions with stable depressive symptoms in feature representations, a phenomenon termed \emph{Emotional Ambiguity}, thereby leading to detection errors. To address this critical issue, we propose READ-Net, the first audio-visual depression detection framework explicitly designed to resolve Emotional Ambiguity through Adaptive Feature Recalibration (AFR). The core insight of AFR is to dynamically adjust the weights of emotional features to enhance depression-related signals. Rather than merely overlooking or naively combining emotional information, READ-Net innovatively identifies and preserves depressive-relevant cues within emotional features, while adaptively filtering out irrelevant emotional noise. This recalibration strategy significantly clarifies feature representations, and effectively mitigates the persistent challenge of emotional interference. Additionally, READ-Net can be easily integrated into existing frameworks for improved performance. Extensive evaluations on three publicly available datasets show that READ-Net outperforms state-of-the-art methods, with average gains of 4.55\% in accuracy and 1.26\% in F1-score, demonstrating its robustness to emotional disturbances and improving audio-visual depression detection.
CVApr 4
Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual StimuliChenglizhao Chen, Shujian Zhang, Luming Li et al.
Existing \textbf{s}alient \textbf{o}bject \textbf{d}etection (SOD) methods adopt a \textbf{passive} visual stimulus-based rationale--objects with the strongest visual stimuli are perceived as the user's primary focus (i.e., salient objects). They ignore the decisive role of users' \textbf{proactive needs} in segmenting salient objects--if a user has a need before seeing an image, the user's salient objects align with their needs, e.g., if a user's need is ``white apple'', when this user sees an image, the user's primary focus is on the ``white apple'' or ``the most white apple-like'' objects in the image. Such an oversight not only \textbf{fails to satisfy users}, but also \textbf{limits the development of downstream tasks}. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users' viewing order (usually determined by user's needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbf{User} \textbf{S}alient \textbf{O}bject \textbf{D}etection (UserSOD) task, which focuses on \textbf{detecting salient objects align with users' proactive needs when user have needs}. The main challenge for this new task is the lack of datasets for model training and testing.
CVJan 22, 2024
Colorectal Polyp Segmentation in the Deep Learning Era: A Comprehensive SurveyZhenyu Wu, Fengmao Lv, Chenglizhao Chen et al.
Colorectal polyp segmentation (CPS), an essential problem in medical image analysis, has garnered growing research attention. Recently, the deep learning-based model completely overwhelmed traditional methods in the field of CPS, and more and more deep CPS methods have emerged, bringing the CPS into the deep learning era. To help the researchers quickly grasp the main techniques, datasets, evaluation metrics, challenges, and trending of deep CPS, this paper presents a systematic and comprehensive review of deep-learning-based CPS methods from 2014 to 2023, a total of 115 technical papers. In particular, we first provide a comprehensive review of the current deep CPS with a novel taxonomy, including network architectures, level of supervision, and learning paradigm. More specifically, network architectures include eight subcategories, the level of supervision comprises six subcategories, and the learning paradigm encompasses 12 subcategories, totaling 26 subcategories. Then, we provided a comprehensive analysis the characteristics of each dataset, including the number of datasets, annotation types, image resolution, polyp size, contrast values, and polyp location. Following that, we summarized CPS's commonly used evaluation metrics and conducted a detailed analysis of 40 deep SOTA models, including out-of-distribution generalization and attribute-based performance analysis. Finally, we discussed deep learning-based CPS methods' main challenges and opportunities.
CVDec 6, 2023
Rethinking Object Saliency Ranking: A Novel Whole-flow Processing ParadigmMengke Song, Linfeng Li, Dunquan Wu et al.
Existing salient object detection methods are capable of predicting binary maps that highlight visually salient regions. However, these methods are limited in their ability to differentiate the relative importance of multiple objects and the relationships among them, which can lead to errors and reduced accuracy in downstream tasks that depend on the relative importance of multiple objects. To conquer, this paper proposes a new paradigm for saliency ranking, which aims to completely focus on ranking salient objects by their "importance order". While previous works have shown promising performance, they still face ill-posed problems. First, the saliency ranking ground truth (GT) orders generation methods are unreasonable since determining the correct ranking order is not well-defined, resulting in false alarms. Second, training a ranking model remains challenging because most saliency ranking methods follow the multi-task paradigm, leading to conflicts and trade-offs among different tasks. Third, existing regression-based saliency ranking methods are complex for saliency ranking models due to their reliance on instance mask-based saliency ranking orders. These methods require a significant amount of data to perform accurately and can be challenging to implement effectively. To solve these problems, this paper conducts an in-depth analysis of the causes and proposes a whole-flow processing paradigm of saliency ranking task from the perspective of "GT data generation", "network structure design" and "training protocol". The proposed approach outperforms existing state-of-the-art methods on the widely-used SALICON set, as demonstrated by extensive experiments with fair and reasonable comparisons. The saliency ranking task is still in its infancy, and our proposed unified framework can serve as a fundamental strategy to guide future work.
MMJan 15
EditEmoTalk: Controllable Speech-Driven 3D Facial Animation with Continuous Expression EditingDiqiong Jiang, Kai Zhu, Dan Song et al.
Speech-driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio. While recent methods achieve high-quality lip synchronization, they often rely on discrete emotion categories, limiting continuous and fine-grained emotional control. We present EditEmoTalk, a controllable speech-driven 3D facial animation framework with continuous emotion editing. The key idea is a boundary-aware semantic embedding that learns the normal directions of inter-emotion decision boundaries, enabling a continuous expression manifold for smooth emotion manipulation. Moreover, we introduce an emotional consistency loss that enforces semantic alignment between the generated motion dynamics and the target emotion embedding through a mapping network, ensuring faithful emotional expression. Extensive experiments demonstrate that EditEmoTalk achieves superior controllability, expressiveness, and generalization while maintaining accurate lip synchronization. Code and pretrained models will be released.
RONov 26, 2025
AerialMind: Towards Referring Multi-Object Tracking in UAV ScenariosChenglizhao Chen, Shaofeng Liang, Runwei Guan et al.
Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.
CVJul 17, 2025
Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental contextMengke Song, Yuge Xie, Qi Cui et al.
Emotion recognition,as a step toward mind reading,seeks to infer internal states from external cues.Most existing methods rely on explicit signals-such as facial expressions,speech,or gestures-that reflect only bodily responses and overlook the influence of environmental context.These cues are often voluntary,easy to mask,and insufficient for capturing deeper,implicit emotions. Physiological signal-based approaches offer more direct access to internal states but require complex sensors that compromise natural behavior and limit scalability.Gaze-based methods typically rely on static fixation analysis and fail to capture the rich,dynamic interactions between gaze and the environment,and thus cannot uncover the deep connection between emotion and implicit behavior.To address these limitations,we propose a novel camera-based,user-unaware emotion recognition approach that integrates gaze fixation patterns with environmental semantics and temporal dynamics.Leveraging standard HD cameras,our method unobtrusively captures users'eye appearance and head movements in natural settings-without the need for specialized hardware or active user participation.From these visual cues,the system estimates gaze trajectories over time and space, providing the basis for modeling the spatial, semantic,and temporal dimensions of gaze behavior. This allows us to capture the dynamic interplay between visual attention and the surrounding environment,revealing that emotions are not merely physiological responses but complex outcomes of human-environment interactions.The proposed approach enables user-unaware,real-time,and continuous emotion recognition,offering high generalizability and low deployment cost.
CVMay 23, 2023
WinDB: HMD-free and Distortion-free Panoptic Video Fixation LearningGuotao Wang, Chenglizhao Chen, Aimin Hao et al.
To date, the widely adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where users' fixations are collected while wearing an HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist "blind zooms" when using HMD to collect fixations since the users cannot keep spinning their heads to explore the entire panoptic scene all the time. Consequently, the collected fixations tend to be trapped in some local views, leaving the remaining areas to be the "blind zooms". Therefore, fixation data collected using HMD-based methods that accumulate local views cannot accurately represent the overall global importance - the main purpose of fixations - of complex panoptic scenes. To conquer, this paper introduces the auxiliary window with a dynamic blurring (WinDB) fixation collection approach for panoptic video, which doesn't need HMD and is able to well reflect the regional-wise importance degree. Using our WinDB approach, we have released a new PanopticVideo-300 dataset, containing 300 panoptic clips covering over 225 categories. Specifically, since using WinDB to collect fixations is blind zoom free, there exists frequent and intensive "fixation shifting" - a very special phenomenon that has long been overlooked by the previous research - in our new set. Thus, we present an effective fixation shifting network (FishNet) to conquer it. All these new fixation collection tool, dataset, and network could be very potential to open a new age for fixation-related research and applications in 360o environments.
IVJan 8, 2022
Hyperspectral Image Denoising Using Non-convex Local Low-rank and Sparse Separation with Spatial-Spectral Total Variation RegularizationChong Peng, Yang Liu, Yongyong Chen et al.
In this paper, we propose a novel nonconvex approach to robust principal component analysis for HSI denoising, which focuses on simultaneously developing more accurate approximations to both rank and column-wise sparsity for the low-rank and sparse components, respectively. In particular, the new method adopts the log-determinant rank approximation and a novel $\ell_{2,\log}$ norm, to restrict the local low-rank or column-wisely sparse properties for the component matrices, respectively. For the $\ell_{2,\log}$-regularized shrinkage problem, we develop an efficient, closed-form solution, which is named $\ell_{2,\log}$-shrinkage operator. The new regularization and the corresponding operator can be generally used in other problems that require column-wise sparsity. Moreover, we impose the spatial-spectral total variation regularization in the log-based nonconvex RPCA model, which enhances the global piece-wise smoothness and spectral consistency from the spatial and spectral views in the recovered HSI. Extensive experiments on both simulated and real HSIs demonstrate the effectiveness of the proposed method in denoising HSIs.
CVApr 3, 2021
Mutual Graph Learning for Camouflaged Object DetectionQiang Zhai, Xin Li, Fan Yang et al.
Automatically detecting/segmenting object(s) that blend in with their surroundings is difficult for current models. A major challenge is that the intrinsic similarities between such foreground objects and background surroundings make the features extracted by deep model indistinguishable. To overcome this challenge, an ideal model should be able to seek valuable, extra clues from the given scene and incorporate them into a joint learning framework for representation co-enhancement. With this inspiration, we design a novel Mutual Graph Learning (MGL) model, which generalizes the idea of conventional mutual learning from regular grids to the graph domain. Specifically, MGL decouples an image into two task-specific feature maps -- one for roughly locating the target and the other for accurately capturing its boundary details -- and fully exploits the mutual benefits by recurrently reasoning their high-order relations through graphs. Importantly, in contrast to most mutual learning approaches that use a shared function to model all between-task interactions, MGL is equipped with typed functions for handling different complementary relations to maximize information interactions. Experiments on challenging datasets, including CHAMELEON, CAMO and COD10K, demonstrate the effectiveness of our MGL with superior performance to existing state-of-the-art methods.
CVNov 3, 2020
Kernel Two-Dimensional Ridge Regression for Subspace ClusteringChong Peng, Qian Zhang, Zhao Kang et al.
Subspace clustering methods have been widely studied recently. When the inputs are 2-dimensional (2D) data, existing subspace clustering methods usually convert them into vectors, which severely damages inherent structures and relationships from original data. In this paper, we propose a novel subspace clustering method for 2D data. It directly uses 2D data as inputs such that the learning of representations benefits from inherent structures and relationships of the data. It simultaneously seeks image projection and representation coefficients such that they mutually enhance each other and lead to powerful data representations. An efficient algorithm is developed to solve the proposed objective function with provable decreasing and convergence property. Extensive experimental results verify the effectiveness of the new method.
CVAug 10, 2020
Rethinking of the Image Salient Object Detection: Object-level Semantic Saliency Re-ranking First, Pixel-wise Saliency Refinement LatterZhenyu Wu, Shuai Li, Chenglizhao Chen et al.
The real human attention is an interactive activity between our visual system and our brain, using both low-level visual stimulus and high-level semantic information. Previous image salient object detection (SOD) works conduct their saliency predictions in a multi-task manner, i.e., performing pixel-wise saliency regression and segmentation-like saliency refinement at the same time, which degenerates their feature backbones in revealing semantic information. However, given an image, we tend to pay more attention to those regions which are semantically salient even in the case that these regions are perceptually not the most salient ones at first glance. In this paper, we divide the SOD problem into two sequential tasks: 1) we propose a lightweight, weakly supervised deep network to coarsely locate those semantically salient regions first; 2) then, as a post-processing procedure, we selectively fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement. In sharp contrast to the state-of-the-art (SOTA) methods that focus on learning pixel-wise saliency in "single image" using perceptual clues mainly, our method has investigated the "object-level semantic ranks between multiple images", of which the methodology is more consistent with the real human attention mechanism. Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
CVAug 7, 2020
Knowing Depth Quality In Advance: A Depth Quality Assessment Method For RGB-D Salient Object DetectionXuehao Wang, Shuai Li, Chenglizhao Chen et al.
Previous RGB-D salient object detection (SOD) methods have widely adopted deep learning tools to automatically strike a trade-off between RGB and D (depth), whose key rationale is to take full advantage of their complementary nature, aiming for a much-improved SOD performance than that of using either of them solely. However, such fully automatic fusions may not always be helpful for the SOD task because the D quality itself usually varies from scene to scene. It may easily lead to a suboptimal fusion result if the D quality is not considered beforehand. Moreover, as an objective factor, the D quality has long been overlooked by previous work. As a result, it is becoming a clear performance bottleneck. Thus, we propose a simple yet effective scheme to measure D quality in advance, the key idea of which is to devise a series of features in accordance with the common attributes of high-quality D regions. To be more concrete, we conduct D quality assessments for each image region, following a multi-scale methodology that includes low-level edge consistency, mid-level regional uncertainty and high-level model variance. All these components will be computed independently and then be assembled with RGB and D features, applied as implicit indicators, to guide the selective fusion. Compared with the state-of-the-art fusion schemes, our method can achieve a more reasonable fusion status between RGB and D. Specifically, the proposed D quality measurement method achieves steady performance improvements for almost 2.0\% in general.
CVAug 7, 2020
Recursive Multi-model Complementary Deep Fusion forRobust Salient Object Detection via Parallel Sub NetworksZhenyu Wu, Shuai Li, Chenglizhao Chen et al.
Fully convolutional networks have shown outstanding performance in the salient object detection (SOD) field. The state-of-the-art (SOTA) methods have a tendency to become deeper and more complex, which easily homogenize their learned deep features, resulting in a clear performance bottleneck. In sharp contrast to the conventional ``deeper'' schemes, this paper proposes a ``wider'' network architecture which consists of parallel sub networks with totally different network architectures. In this way, those deep features obtained via these two sub networks will exhibit large diversity, which will have large potential to be able to complement with each other. However, a large diversity may easily lead to the feature conflictions, thus we use the dense short-connections to enable a recursively interaction between the parallel sub networks, pursuing an optimal complementary status between multi-model deep features. Finally, all these complementary multi-model deep features will be selectively fused to make high-performance salient object detections. Extensive experiments on several famous benchmarks clearly demonstrate the superior performance, good generalization, and powerful learning ability of the proposed wider framework.
CVAug 7, 2020
Full Reference Screen Content Image Quality Assessment by Fusing Multi-level Structure SimilarityChenglizhao Chen, Hongmeng Zhao, Huan Yang et al.
The screen content images (SCIs) usually comprise various content types with sharp edges, in which the artifacts or distortions can be well sensed by the vanilla structure similarity measurement in a full reference manner. Nonetheless, almost all of the current SOTA structure similarity metrics are "locally" formulated in a single-level manner, while the true human visual system (HVS) follows the multi-level manner, and such mismatch could eventually prevent these metrics from achieving trustworthy quality assessment. To ameliorate, this paper advocates a novel solution to measure structure similarity "globally" from the perspective of sparse representation. To perform multi-level quality assessment in accordance with the real HVS, the above-mentioned global metric will be integrated with the conventional local ones by resorting to the newly devised selective deep fusion network. To validate its efficacy and effectiveness, we have compared our method with 12 SOTA methods over two widely-used large-scale public SCI datasets, and the quantitative results indicate that our method yields significantly higher consistency with subjective quality score than the currently leading works. Both the source code and data are also publicly available to gain widespread acceptance and facilitate new advancement and its validation.
CVAug 7, 2020
Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object DetectionXuehao Wang, Shuai Li, Chenglizhao Chen et al.
Existing RGB-D salient object detection methods treat depth information as an independent component to complement its RGB part, and widely follow the bi-stream parallel network architecture. To selectively fuse the CNNs features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bi-stream networks usually consist of two independent subbranches; i.e., one subbranch is used for RGB saliency and the other aims for depth saliency. However, its depth saliency is persistently inferior to the RGB saliency because the RGB component is intrinsically more informative than the depth component. The bi-stream architecture easily biases its subsequent fusion procedure to the RGB subbranch, leading to a performance bottleneck. In this paper, we propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction, where we cyclically convert the original 4-dimensional RGB-D into \textbf{D}GB, R\textbf{D}B and RG\textbf{D}. Then, a newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D, achieving a new SOTA performance.
CVAug 7, 2020
Depth Quality Aware Salient Object DetectionChenglizhao Chen, Jipeng Wei, Chong Peng et al.
The existing fusion based RGB-D salient object detection methods usually adopt the bi-stream structure to strike the fusion trade-off between RGB and depth (D). The D quality usually varies from scene to scene, while the SOTA bi-stream approaches are depth quality unaware, which easily result in substantial difficulties in achieving complementary fusion status between RGB and D, leading to poor fusion results in facing of low-quality D. Thus, this paper attempts to integrate a novel depth quality aware subnet into the classic bi-stream structure, aiming to assess the depth quality before conducting the selective RGB-D fusion. Compared with the SOTA bi-stream methods, the major highlight of our method is its ability to lessen the importance of those low-quality, no-contribution, or even negative-contribution D regions during the RGB-D fusion, achieving a much improved complementary status between RGB and D.
CVAug 7, 2020
Exploring Rich and Efficient Spatial Temporal Interactions for Real Time Video Salient Object DetectionChenglizhao Chen, Guotao Wang, Chong Peng et al.
The current main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches. As a complementary component, the main task for the temporal branch is to intermittently focus the spatial branch on those regions with salient movements. In this way, even though the overall video saliency quality is heavily dependent on its spatial branch, however, the performance of the temporal branch still matter. Thus, the key factor to improve the overall video saliency is how to further boost the performance of these branches efficiently. In this paper, we propose a novel spatiotemporal network to achieve such improvement in a full interactive fashion. We integrate a lightweight temporal model into the spatial branch to coarsely locate those spatially salient regions which are correlated with trustworthy salient movements. Meanwhile, the spatial branch itself is able to recurrently refine the temporal model in a multi-scale manner. In this way, both the spatial and temporal branches are able to interact with each other, achieving the mutual performance improvement. Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.
CVAug 7, 2020
A Novel Video Salient Object Detection Method via Semi-supervised Motion Quality PerceptionChenglizhao Chen, Jia Song, Chong Peng et al.
Previous video salient object detection (VSOD) approaches have mainly focused on designing fancy networks to achieve their performance improvements. However, with the slow-down in development of deep learning techniques recently, it may become more and more difficult to anticipate another breakthrough via fancy networks solely. To this end, this paper proposes a universal learning scheme to get a further 3\% performance improvement for all state-of-the-art (SOTA) methods. The major highlight of our method is that we resort the "motion quality"---a brand new concept, to select a sub-group of video frames from the original testing set to construct a new training set. The selected frames in this new training set should all contain high-quality motions, in which the salient objects will have large probability to be successfully detected by the "target SOTA method"---the one we want to improve. Consequently, we can achieve a significant performance improvement by using this new training set to start a new round of network training. During this new round training, the VSOD results of the target SOTA method will be applied as the pseudo training objectives. Our novel learning scheme is simple yet effective, and its semi-supervised methodology may have large potential to inspire the VSOD community in the future.
CVAug 7, 2020
A Deeper Look at Salient Object Detection: Bi-stream Network with a Small Training DatasetZhenyu Wu, Shuai Li, Chenglizhao Chen et al.
Compared with the conventional hand-crafted approaches, the deep learning based methods have achieved tremendous performance improvements by training exquisitely crafted fancy networks over large-scale training sets. However, do we really need large-scale training set for salient object detection (SOD)? In this paper, we provide a deeper insight into the interrelationship between the SOD performances and the training sets. To alleviate the conventional demands for large-scale training data, we provide a feasible way to construct a novel small-scale training set, which only contains 4K images. Moreover, we propose a novel bi-stream network to take full advantage of our proposed small training set, which is consisted of two feature backbones with different structures, achieving complementary semantical saliency fusion via the proposed gate control unit. To our best knowledge, this is the first attempt to use a small-scale training set to outperform state-of-the-art models which are trained on large-scale training sets; nevertheless, our method can still achieve the leading state-of-the-art performance on five benchmark datasets.
CVAug 2, 2020
A Plug-and-play Scheme to Adapt Image Saliency Deep Model for Video DataYunxiao Li, Shuai Li, Chenglizhao Chen et al.
With the rapid development of deep learning techniques, image saliency deep models trained solely by spatial information have occasionally achieved detection performance for video data comparable to that of the models trained by both spatial and temporal information. However, due to the lesser consideration of temporal information, the image saliency deep models may become fragile in the video sequences dominated by temporal information. Thus, the most recent video saliency detection approaches have adopted the network architecture starting with a spatial deep model that is followed by an elaborately designed temporal deep model. However, such methods easily encounter the performance bottleneck arising from the single stream learning methodology, so the overall detection performance is largely determined by the spatial deep model. In sharp contrast to the current mainstream methods, this paper proposes a novel plug-and-play scheme to weakly retrain a pretrained image saliency deep model for video data by using the newly sensed and coded temporal information. Thus, the retrained image saliency deep model will be able to maintain temporal saliency awareness, achieving much improved detection performance. Moreover, our method is simple yet effective for adapting any off-the-shelf pre-trained image saliency deep model to obtain high-quality video saliency detection. Additionally, both the data and source code of our method are publicly available.
LGMay 19, 2020
Two-Dimensional Semi-Nonnegative Matrix Factorization for ClusteringChong Peng, Zhilu Zhang, Zhao Kang et al.
In this paper, we propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF. It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step. In particular, projection matrices are sought under the guidance of building new data representations, such that the spatial information is retained and projections are enhanced by the goal of clustering, which helps construct optimal projection directions. Moreover, to exploit nonlinear structures of the data, manifold is constructed in the projected subspace, which is adaptively updated according to the projections and less afflicted with noise and outliers of the data and thus more representative in the projected space. Hence, seeking projections, building new data representations, and learning manifold are seamlessly integrated in a single model, which mutually enhance other and lead to a powerful data representation. Comprehensive experimental results verify the effectiveness of TS-NMF in comparison with several state-of-the-art algorithms, which suggests high potential of the proposed method for real world applications.
LGJul 9, 2019
Nonnegative Matrix Factorization with Local Similarity LearningChong Peng, Zhao Kang, Chenglizhao Chen et al.
Existing nonnegative matrix factorization methods focus on learning global structure of the data to construct basis and coefficient matrices, which ignores the local structure that commonly exists among data. In this paper, we propose a new type of nonnegative matrix factorization method, which learns local similarity and clustering in a mutually enhancing way. The learned new representation is more representative in that it better reveals inherent geometric property of the data. Nonlinear expansion is given and efficient multiplicative updates are developed with theoretical convergence guarantees. Extensive experimental results have confirmed the effectiveness of the proposed model.
LGApr 16, 2019
RES-PCA: A Scalable Approach to Recovering Low-rank MatricesChong Peng, Chenglizhao Chen, Zhao Kang et al.
Robust principal component analysis (RPCA) has drawn significant attentions due to its powerful capability in recovering low-rank matrices as well as successful appplications in various real world problems. The current state-of-the-art algorithms usually need to solve singular value decomposition of large matrices, which generally has at least a quadratic or even cubic complexity. This drawback has limited the application of RPCA in solving real world problems. To combat this drawback, in this paper we propose a new type of RPCA method, RES-PCA, which is linearly efficient and scalable in both data size and dimension. For comparison purpose, AltProj, an existing scalable approach to RPCA requires the precise knowlwdge of the true rank; otherwise, it may fail to recover low-rank matrices. By contrast, our method works with or without knowing the true rank; even when both methods work, our method is faster. Extensive experiments have been performed and testified to the effectiveness of proposed method quantitatively and in visual quality, which suggests that our method is suitable to be employed as a light-weight, scalable component for RPCA in any application pipelines.