CVJul 8, 2023Code
Camouflaged Object Detection with Feature Grafting and Distractor AwareYuxuan Song, Xinyue Li, Lin Qi
The task of Camouflaged Object Detection (COD) aims to accurately segment camouflaged objects that integrated into the environment, which is more challenging than ordinary detection as the texture between the target and background is visually indistinguishable. In this paper, we proposed a novel Feature Grafting and Distractor Aware network (FDNet) to handle the COD task. Specifically, we use CNN and Transformer to encode multi-scale images in parallel. In order to better explore the advantages of the two encoders, we design a cross-attention-based Feature Grafting Module to graft features extracted from Transformer branch into CNN branch, after which the features are aggregated in the Feature Fusion Module. A Distractor Aware Module is designed to explicitly model the two possible distractors in the COD task to refine the coarse camouflage map. We also proposed the largest artificial camouflaged object dataset which contains 2000 images with annotations, named ACOD2K. We conducted extensive experiments on four widely used benchmark datasets and the ACOD2K dataset. The results show that our method significantly outperforms other state-of-the-art methods. The code and the ACOD2K will be available at https://github.com/syxvision/FDNet.
IVApr 22, 2023
SAWU-Net: Spatial Attention Weighted Unmixing Network for Hyperspectral ImagesLin Qi, Xuewen Qin, Feng Gao et al.
Hyperspectral unmixing is a critical yet challenging task in hyperspectral image interpretation. Recently, great efforts have been made to solve the hyperspectral unmixing task via deep autoencoders. However, existing networks mainly focus on extracting spectral features from mixed pixels, and the employment of spatial feature prior knowledge is still insufficient. To this end, we put forward a spatial attention weighted unmixing network, dubbed as SAWU-Net, which learns a spatial attention network and a weighted unmixing network in an end-to-end manner for better spatial feature exploitation. In particular, we design a spatial attention module, which consists of a pixel attention block and a window attention block to efficiently model pixel-based spectral information and patch-based spatial information, respectively. While in the weighted unmixing framework, the central pixel abundance is dynamically weighted by the coarse-grained abundances of surrounding pixels. In addition, SAWU-Net generates dynamically adaptive spatial weights through the spatial attention mechanism, so as to dynamically integrate surrounding pixels more effectively. Experimental results on real and synthetic datasets demonstrate the better accuracy and superiority of SAWU-Net, which reflects the effectiveness of the proposed spatial attention mechanism.
CVJul 8, 2023Code
Edge-Aware Mirror Network for Camouflaged Object DetectionDongyue Sun, Shiyao Jiang, Lin Qi
Existing edge-aware camouflaged object detection (COD) methods normally output the edge prediction in the early stage. However, edges are important and fundamental factors in the following segmentation task. Due to the high visual similarity between camouflaged targets and the surroundings, edge prior predicted in early stage usually introduces erroneous foreground-background and contaminates features for segmentation. To tackle this problem, we propose a novel Edge-aware Mirror Network (EAMNet), which models edge detection and camouflaged object segmentation as a cross refinement process. More specifically, EAMNet has a two-branch architecture, where a segmentation-induced edge aggregation module and an edge-induced integrity aggregation module are designed to cross-guide the segmentation branch and edge detection branch. A guided-residual channel attention module which leverages the residual connection and gated convolution finally better extracts structural details from low-level features. Quantitative and qualitative experiment results show that EAMNet outperforms existing cutting-edge baselines on three widely used COD datasets. Codes are available at https://github.com/sdy1999/EAMNet.
CVMay 25
Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask GuidanceXia Li, Xinran Liu, Lin Qi et al.
Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.
MMSep 28, 2023
CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware PromptingShaoxiang Guo, Qing Cai, Lin Qi et al.
Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial distribution (in x, y, and z axes) is encoded to form pose-aware features. Subsequently, we maximize semantic consistency for a pair of pose-text features following a CLIP-based contrastive learning paradigm. Furthermore, a coarse-to-fine mesh regressor is designed, which is capable of effectively querying joint-aware cues from the feature pyramid. Extensive experiments on several public hand benchmarks show that the proposed model attains a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone.
CVMay 21
Multi-scale interaction network for stereo image super-resolutionLiyi Xu, Lin Qi
Stereo image super-resolution aims to generate high-resolution images by leveraging complementary information from binocular systems. Although previous studies have achieved impressive results, the potential of intra-view and cross-view information has not been fully exploited. To address this issue, we propose a novel multi-scale interaction network for stereo image super-resolution. Specifically, we design a Multi-scale Spatial-Channel Attention Module that utilizes multi-scale large separable kernel attention and simple channel attention to improve intra-view feature extraction. Additionally, we propose a Dual-View Epipolar Attention Module, utilizing an optimal transport algorithm to achieve more accurate matching along the epipolar line. Extensive experimental and ablation studies show that our method achieves competitive results that outperform most SOTA methods.
IVApr 30Code
Spectral Dynamic Attention Network for Hyperspectral Image Super-ResolutionTengya Zhang, Feng Gao, Lin Qi et al.
Hyperspectral image super-resolution is essential for enhancing the spatial fidelity of HSI data, yet existing deep learning methods often struggle with substantial spectral redundancy and the limited non-linear modeling capacity of standard feed-forward networks (FFNs). To address these challenges, we propose Spectral Dynamic Attention Network (SDANet), a framework designed to adaptively suppress redundant spectral interactions. SDANet integrates two key components: 1) Dynamic Channel Sparse Attention (DCSA) module that computes channel-wise correlations and selectively preserves the most informative attention responses through dynamic and data-dependent sparsification. 2) Frequency-Enhanced Feed-Forward Network (FE-FFN) that jointly models spatial and frequency-domain representations to enhance non-linear expressiveness. Extensive experiments on two benchmark datasets demonstrate that SDANet achieves state-of-the-art HISR performance while maintaining competitive efficiency. The code will be made publicly available at https://github.com/oucailab/SDANet.
IVAug 22, 2024
Hierarchical Attention and Parallel Filter Fusion Network for Multi-Source Data ClassificationHan Luo, Feng Gao, Junyu Dong et al.
Hyperspectral image (HSI) and synthetic aperture radar (SAR) data joint classification is a crucial and yet challenging task in the field of remote sensing image interpretation. However, feature modeling in existing methods is deficient to exploit the abundant global, spectral, and local features simultaneously, leading to sub-optimal classification performance. To solve the problem, we propose a hierarchical attention and parallel filter fusion network for multi-source data classification. Concretely, we design a hierarchical attention module for hyperspectral feature extraction. This module integrates global, spectral, and local features simultaneously to provide more comprehensive feature representation. In addition, we develop parallel filter fusion module which enhances cross-modal feature interactions among different spatial locations in the frequency domain. Extensive experiments on two multi-source remote sensing data classification datasets verify the superiority of our proposed method over current state-of-the-art classification approaches. Specifically, our proposed method achieves 91.44% and 80.51% of overall accuracy (OA) on the respective datasets, highlighting its superior performance.
IVMar 10, 2025Code
Dynamic Cross-Modal Feature Interaction Network for Hyperspectral and LiDAR Data ClassificationJunyan Lin, Feng Gap, Lin Qi et al.
Hyperspectral image (HSI) and LiDAR data joint classification is a challenging task. Existing multi-source remote sensing data classification methods often rely on human-designed frameworks for feature extraction, which heavily depend on expert knowledge. To address these limitations, we propose a novel Dynamic Cross-Modal Feature Interaction Network (DCMNet), the first framework leveraging a dynamic routing mechanism for HSI and LiDAR classification. Specifically, our approach introduces three feature interaction blocks: Bilinear Spatial Attention Block (BSAB), Bilinear Channel Attention Block (BCAB), and Integration Convolutional Block (ICB). These blocks are designed to effectively enhance spatial, spectral, and discriminative feature interactions. A multi-layer routing space with routing gates is designed to determine optimal computational paths, enabling data-dependent feature fusion. Additionally, bilinear attention mechanisms are employed to enhance feature interactions in spatial and channel representations. Extensive experiments on three public HSI and LiDAR datasets demonstrate the superiority of DCMNet over state-of-the-art methods. Our code will be available at https://github.com/oucailab/DCMNet.
CVMar 5Code
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation ModelYulong Shi, Shijie Li, Ziyi Li et al.
Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.
CVNov 22, 2025Code
HEAL: Learning-Free Source Free Unsupervised Domain Adaptation for Cross-Modality Medical Image SegmentationYulong Shi, Jiapeng Li, Lin Qi
Growing demands for clinical data privacy and storage constraints have spurred advances in Source Free Unsupervised Domain Adaptation (SFUDA). SFUDA addresses the domain shift by adapting models from the source domain to the unseen target domain without accessing source data, even when target-domain labels are unavailable. However, SFUDA faces significant challenges: the absence of source domain data and label supervision in the target domain due to source free and unsupervised settings. To address these issues, we propose HEAL, a novel SFUDA framework that integrates Hierarchical denoising, Edge-guided selection, size-Aware fusion, and Learning-free characteristic. Large-scale cross-modality experiments demonstrate that our method outperforms existing SFUDA approaches, achieving state-of-the-art (SOTA) performance. The source code is publicly available at: https://github.com/derekshiii/HEAL.
CVJun 24, 2024Code
Exploring Cross-Domain Few-Shot Classification via Frequency-Aware PromptingTiange Zhang, Qing Cai, Feng Gao et al.
Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning. However, most existing methods pay more attention to learning domain-adaptive inductive bias (meta-knowledge) through feature-wise manipulation or task diversity improvement while neglecting the phenomenon that deep networks tend to rely more on high-frequency cues to make the classification decision, which thus degenerates the robustness of learned inductive bias since high-frequency information is vulnerable and easy to be disturbed by noisy information. Hence in this paper, we make one of the first attempts to propose a Frequency-Aware Prompting method with mutual attention for Cross-Domain Few-Shot classification, which can let networks simulate the human visual perception of selecting different frequency cues when facing new recognition tasks. Specifically, a frequency-aware prompting mechanism is first proposed, in which high-frequency components of the decomposed source image are switched either with normal distribution sampling or zeroing to get frequency-aware augment samples. Then, a mutual attention module is designed to learn generalizable inductive bias under CD-FSL settings. More importantly, the proposed method is a plug-and-play module that can be directly applied to most off-the-shelf CD-FLS methods. Experimental results on CD-FSL benchmarks demonstrate the effectiveness of our proposed method as well as robustly improve the performance of existing CD-FLS methods. Resources at https://github.com/tinkez/FAP_CDFSC.
IRMay 1
Intelligent Elastic Feature Fading: Enabling Model Retrain-Free Feature Efficiency Rollouts at ScaleJieming Di, Xiaoyu Chen, Ying She et al.
Large-scale ranking systems depend on thousands of features derived from user behavior across multiple time horizons. Typically requires model retraining -- resulting in long iteration cycles (3--6 months), substantial GPU resource consumption, and limited rollout throughput. We introduce Intelligent Elastic Feature Fading (IEFF), a production infrastructure system that enables retrain-free feature efficiency rollouts by elastically controlling feature coverage and distribution at serving time. IEFF supports incremental feature coverage adjustments while models adapt through recurring training, eliminating dependencies on explicit retraining cycles. The system incorporates strict safety guardrails, reversibility mechanisms, and comprehensive monitoring to ensure stability at scale. Across multiple production use cases, IEFF accelerates efficiency-related rollouts by 5$\times$, eliminates retraining-related GPU overhead, and enables faster capacity recycling. Extensive offline and online experiments demonstrate that gradual feature fading prevents 50--55\% of online performance degradation compared to abrupt feature removal, while maintaining stable model behavior. These results establish elastic, system-level feature fading as a practical and scalable approach for managing feature efficiency in modern industrial ranking systems.
CVDec 11, 2024
A Deep Semantic Segmentation Network with Semantic and Contextual RefinementsZhiyan Wang, Deyin Liu, Lin Yuanbo Wu et al.
Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets-Cityscapes, Bdd100K, and ADE20K-demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.
CVApr 11, 2024
RMAFF-PSN: A Residual Multi-Scale Attention Feature Fusion Photometric Stereo NetworkKai Luo, Yakun Ju, Lin Qi et al.
Predicting accurate normal maps of objects from two-dimensional images in regions of complex structure and spatial material variations is challenging using photometric stereo methods due to the influence of surface reflection properties caused by variations in object geometry and surface materials. To address this issue, we propose a photometric stereo network called a RMAFF-PSN that uses residual multiscale attentional feature fusion to handle the ``difficult'' regions of the object. Unlike previous approaches that only use stacked convolutional layers to extract deep features from the input image, our method integrates feature information from different resolution stages and scales of the image. This approach preserves more physical information, such as texture and geometry of the object in complex regions, through shallow-deep stage feature extraction, double branching enhancement, and attention optimization. To test the network structure under real-world conditions, we propose a new real dataset called Simple PS data, which contains multiple objects with varying structures and materials. Experimental results on a publicly available benchmark dataset demonstrate that our method outperforms most existing calibrated photometric stereo methods for the same number of input images, especially in the case of highly non-convex object structures. Our method also obtains good results under sparse lighting conditions.
CVApr 5
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching DistillationHaoyu Li, Tingyan Wen, Lin Qi et al.
Diffusion models produce high-quality text-to-image results, but their iterative denoising is computationally expensive.Distribution Matching Distillation (DMD) emerges as a promising path to few-step distillation, but suffers from diversity collapse and fidelity degradation when reduced to two steps or fewer. We present 1.x-Distill, the first fractional-step distillation framework that breaks the integer-step constraint of prior few-step methods and establishes 1.x-step generation as a practical regime for distilled diffusion models.Specifically, we first analyze the overlooked role of teacher CFG in DMD and introduce a simple yet effective modification to suppress mode collapse. Then, to improve performance under extreme steps, we introduce Stagewise Focused Distillation, a two-stage strategy that learns coarse structure through diversity-preserving distribution matching and refines details with inference-consistent adversarial distillation. Furthermore, we design a lightweight compensation module for Distill--Cache co-Training, which naturally incorporates block-level caching into our distillation pipeline.Experiments on SD3-Medium and SD3.5-Large show that 1.x-Distill surpasses prior few-step methods, achieving better quality and diversity at 1.67 and 1.74 effective NFEs, respectively, with up to 33x speedup over original 28x2 NFE sampling.
CVDec 16, 2024
Image Gradient-Aided Photometric Stereo NetworkKaixuan Wang, Lin Qi, Shiyu Qin et al.
Photometric stereo (PS) endeavors to ascertain surface normals using shading clues from photometric images under various illuminations. Recent deep learning-based PS methods often overlook the complexity of object surfaces. These neural network models, which exclusively rely on photometric images for training, often produce blurred results in high-frequency regions characterized by local discontinuities, such as wrinkles and edges with significant gradient changes. To address this, we propose the Image Gradient-Aided Photometric Stereo Network (IGA-PSN), a dual-branch framework extracting features from both photometric images and their gradients. Furthermore, we incorporate an hourglass regression network along with supervision to regularize normal regression. Experiments on DiLiGenT benchmarks show that IGA-PSN outperforms previous methods in surface normal estimation, achieving a mean angular error of 6.46 while preserving textures and geometric shapes in complex regions.
CVDec 11, 2024
A feature refinement module for light-weight semantic segmentation networkZhiyan Wang, Xin Guo, Song Wang et al.
Low computational complexity and high segmentation accuracy are both essential to the real-world semantic segmentation tasks. However, to speed up the model inference, most existing approaches tend to design light-weight networks with a very limited number of parameters, leading to a considerable degradation in accuracy due to the decrease of the representation ability of the networks. To solve the problem, this paper proposes a novel semantic segmentation method to improve the capacity of obtaining semantic information for the light-weight network. Specifically, a feature refinement module (FRM) is proposed to extract semantics from multi-stage feature maps generated by the backbone and capture non-local contextual information by utilizing a transformer block. On Cityscapes and Bdd100K datasets, the experimental results demonstrate that the proposed method achieves a promising trade-off between accuracy and computational cost, especially for Cityscapes test set where 80.4% mIoU is achieved and only 214.82 GFLOPs are required.
AIJan 26
Stability as a Liability:Systematic Breakdown of Linguistic Structure in LLMsXianzhe Meng, Qiangsheng Zeng, Ling Luo et al.
Training stability is typically regarded as a prerequisite for reliable optimization in large language models. In this work, we analyze how stabilizing training dynamics affects the induced generation distribution. We show that under standard maximum likelihood training, stable parameter trajectories lead stationary solutions to approximately minimize the forward KL divergence to the empirical distribution, while implicitly reducing generative entropy. As a consequence, the learned model can concentrate probability mass on a limited subset of empirical modes, exhibiting systematic degeneration despite smooth loss convergence. We empirically validate this effect using a controlled feedback-based training framework that stabilizes internal generation statistics, observing consistent low-entropy outputs and repetitive behavior across architectures and random seeds. It indicates that optimization stability and generative expressivity are not inherently aligned, and that stability alone is an insufficient indicator of generative quality.
CVOct 29, 2025
MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric StereoShiyu Qin, Zhihao Cai, Kaixuan Wang et al.
Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.
CVNov 20, 2024
Superpixel Cost Volume Excitation for Stereo MatchingShanglong Liu, Lin Qi, Junyu Dong et al.
In this work, we concentrate on exciting the intrinsic local consistency of stereo matching through the incorporation of superpixel soft constraints, with the objective of mitigating inaccuracies at the boundaries of predicted disparity maps. Our approach capitalizes on the observation that neighboring pixels are predisposed to belong to the same object and exhibit closely similar intensities within the probability volume of superpixels. By incorporating this insight, our method encourages the network to generate consistent probability distributions of disparity within each superpixel, aiming to improve the overall accuracy and coherence of predicted disparity maps. Experimental evalua tions on widely-used datasets validate the efficacy of our proposed approach, demonstrating its ability to assist cost volume-based matching networks in restoring competitive performance.
SENov 2, 2021
SO{U}RCERER: Developer-Driven Security Testing Framework for Android AppsMuhammad Sajidur Rahman, Blas Kojusner, Ryon Kennedy et al.
Frequently advised secure development recommendations often fall short in practice for app developers. Tool-driven (e.g., using static analysis tools) approaches lack context and domain-specific requirements of an app being tested. App developers struggle to find an actionable and prioritized list of vulnerabilities from a laundry list of security warnings reported by static analysis tools. Process-driven (e.g., applying threat modeling methods) approaches require substantial resources (e.g., security testing team, budget) and security expertise, which small to medium-scale app dev teams could barely afford. To help app developers securing their apps, we propose SO{U}RCERER, a guiding framework for Android app developers for security testing. SO{U}RCERER guides developers to identify domain-specific assets of an app, detect and prioritize vulnerabilities, and mitigate those vulnerabilities based on secure development guidelines. We evaluated SO{U}RCERER with a case study on analyzing and testing 36 Android mobile money apps. We found that by following activities guided by SO{U}RCERER, an app developer could get a concise and actionable list of vulnerabilities (24-61% fewer security warnings produced by SO{U}RCERER than a standalone static analyzer), directly affecting a mobile money app's critical assets, and devise a mitigation plan. Our findings from this preliminary study indicate a viable approach to Android app security testing without being overwhelmingly complex for app developers.
CVJun 22, 2021
Wallpaper Texture Generation and Style Transfer Based on Multi-label SemanticsYing Gao, Xiaohan Feng, Tiange Zhang et al.
Textures contain a wealth of image information and are widely used in various fields such as computer graphics and computer vision. With the development of machine learning, the texture synthesis and generation have been greatly improved. As a very common element in everyday life, wallpapers contain a wealth of texture information, making it difficult to annotate with a simple single label. Moreover, wallpaper designers spend significant time to create different styles of wallpaper. For this purpose, this paper proposes to describe wallpaper texture images by using multi-label semantics. Based on these labels and generative adversarial networks, we present a framework for perception driven wallpaper texture generation and style transfer. In this framework, a perceptual model is trained to recognize whether the wallpapers produced by the generator network are sufficiently realistic and have the attribute designated by given perceptual description; these multi-label semantic attributes are treated as condition variables to generate wallpaper images. The generated wallpaper images can be converted to those with well-known artist styles using CycleGAN. Finally, using the aesthetic evaluation method, the generated wallpaper images are quantitatively measured. The experimental results demonstrate that the proposed method can generate wallpaper textures conforming to human aesthetics and have artistic characteristics.
LGFeb 28, 2021
Discriminative Multiple Canonical Correlation Analysis for Information FusionLei Gao, Lin Qi, Enqing Chen et al.
In this paper, we propose the Discriminative Multiple Canonical Correlation Analysis (DMCCA) for multimodal information analysis and fusion. DMCCA is capable of extracting more discriminative characteristics from multimodal information representations. Specifically, it finds the projected directions which simultaneously maximize the within-class correlation and minimize the between-class correlation, leading to better utilization of the multimodal information. In the process, we analytically demonstrate that the optimally projected dimension by DMCCA can be quite accurately predicted, leading to both superior performance and substantial reduction in computational cost. We further verify that Canonical Correlation Analysis (CCA), Multiple Canonical Correlation Analysis (MCCA) and Discriminative Canonical Correlation Analysis (DCCA) are special cases of DMCCA, thus establishing a unified framework for Canonical Correlation Analysis. We implement a prototype of DMCCA to demonstrate its performance in handwritten digit recognition and human emotion recognition. Extensive experiments show that DMCCA outperforms the traditional methods of serial fusion, CCA, MCCA and DCCA.
CVFeb 28, 2021
The Labeled Multiple Canonical Correlation Analysis for Information FusionLei Gao, Rui Zhang, Lin Qi et al.
The objective of multimodal information fusion is to mathematically analyze information carried in different sources and create a new representation which will be more effectively utilized in pattern recognition and other multimedia information processing tasks. In this paper, we introduce a new method for multimodal information fusion and representation based on the Labeled Multiple Canonical Correlation Analysis (LMCCA). By incorporating class label information of the training samples,the proposed LMCCA ensures that the fused features carry discriminative characteristics of the multimodal information representations, and are capable of providing superior recognition performance. We implement a prototype of LMCCA to demonstrate its effectiveness on handwritten digit recognition,face recognition and object recognition utilizing multiple features,bimodal human emotion recognition involving information from both audio and visual domains. The generic nature of LMCCA allows it to take as input features extracted by any means,including those by deep learning (DL) methods. Experimental results show that the proposed method enhanced the performance of both statistical machine learning (SML) methods, and methods based on DL.
CVFeb 27, 2021
Online Behavioral Analysis with Application to Emotion State IdentificationLei Gao, Lin Qi, Ling Guan
In this paper, we propose a novel discriminative model for online behavioral analysis with application to emotion state identification. The proposed model is able to extract more discriminative characteristics from behavioral data effectively and find the direction of optimal projection efficiently to satisfy requirements of online data analysis, leading to better utilization of the behavioral information to produce more accurate recognition results.
CVJan 27, 2021
Edge-Labeling based Directed Gated Graph Network for Few-shot LearningPeixiao Zheng, Xin Guo, Lin Qi
Existing graph-network-based few-shot learning methods obtain similarity between nodes through a convolution neural network (CNN). However, the CNN is designed for image data with spatial information rather than vector form node feature. In this paper, we proposed an edge-labeling-based directed gated graph network (DGGN) for few-shot learning, which utilizes gated recurrent units to implicitly update the similarity between nodes. DGGN is composed of a gated node aggregation module and an improved gated recurrent unit (GRU) based edge update module. Specifically, the node update module adopts a gate mechanism using activation of edge feature, making a learnable node aggregation process. Besides, improved GRU cells are employed in the edge update procedure to compute the similarity between nodes. Further, this mechanism is beneficial to gradient backpropagation through the GRU sequence across layers. Experiment results conducted on two benchmark datasets show that our DGGN achieves a comparable performance to the-state-of-art methods.
CVJan 7, 2021
Multimodal Gait Recognition for Neurodegenerative DiseasesAite Zhao, Jianbo Li, Junyu Dong et al.
In recent years, single modality based gait recognition has been extensively explored in the analysis of medical images or other sensory data, and it is recognised that each of the established approaches has different strengths and weaknesses. As an important motor symptom, gait disturbance is usually used for diagnosis and evaluation of diseases; moreover, the use of multi-modality analysis of the patient's walking pattern compensates for the one-sidedness of single modality gait recognition methods that only learn gait changes in a single measurement dimension. The fusion of multiple measurement resources has demonstrated promising performance in the identification of gait patterns associated with individual diseases. In this paper, as a useful tool, we propose a novel hybrid model to learn the gait differences between three neurodegenerative diseases, between patients with different severity levels of Parkinson's disease and between healthy individuals and patients, by fusing and aggregating data from multiple sensors. A spatial feature extractor (SFE) is applied to generating representative features of images or signals. In order to capture temporal information from the two modality data, a new correlative memory neural network (CorrMNN) architecture is designed for extracting temporal features. Afterwards, we embed a multi-switch discriminator to associate the observations with individual state estimations. Compared with several state-of-the-art techniques, our proposed framework shows more accurate classification results.
CVJan 7, 2021
Associated Spatio-Temporal Capsule Network for Gait RecognitionAite Zhao, Junyu Dong, Jianbo Li et al.
It is a challenging task to identify a person based on her/his gait patterns. State-of-the-art approaches rely on the analysis of temporal or spatial characteristics of gait, and gait recognition is usually performed on single modality data (such as images, skeleton joint coordinates, or force signals). Evidence has shown that using multi-modality data is more conducive to gait research. Therefore, we here establish an automated learning system, with an associated spatio-temporal capsule network (ASTCapsNet) trained on multi-sensor datasets, to analyze multimodal information for gait recognition. Specifically, we first design a low-level feature extractor and a high-level feature extractor for spatio-temporal feature extraction of gait with a novel recurrent memory unit and a relationship layer. Subsequently, a Bayesian model is employed for the decision-making of class labels. Extensive experiments on several public datasets (normal and abnormal gait) validate the effectiveness of the proposed ASTCapsNet, compared against several state-of-the-art methods.
CVDec 29, 2019
Infant brain MRI segmentation with dilated convolution pyramid downsampling and self-attentionZhihao Lei, Lin Qi, Ying Wei et al.
In this paper, we propose a dual aggregation network to adaptively aggregate different information in infant brain MRI segmentation. More precisely, we added two modules based on 3D-UNet to better model information at different levels and locations. The dilated convolution pyramid downsampling module is mainly to solve the problem of loss of spatial information on the downsampling process, and it can effectively save details while reducing the resolution. The self-attention module can integrate the remote dependence on the feature maps in two dimensions of spatial and channel, effectively improving the representation ability and discriminating ability of the model. Our results are compared to the winners of iseg2017's first evaluation, the DICE ratio of WM and GM increased by 0.7%, and CSF is comparable.In the latest evaluation of the iseg-2019 cross-dataset challenge,we achieve the first place in the DICE of WM and GM, and the DICE of CSF is second.
CVJul 6, 2018
Combining SLAM with muti-spectral photometric stereo for real-time dense 3D reconstructionYuanhong Xu, Pei Dong, Junyu Dong et al.
Obtaining dense 3D reconstrution with low computational cost is one of the important goals in the field of SLAM. In this paper we propose a dense 3D reconstruction framework from monocular multispectral video sequences using jointly semi-dense SLAM and Multispectral Photometric Stereo approaches. Starting from multispectral video, SALM (a) reconstructs a semi-dense 3D shape that will be densified;(b) recovers relative sparse depth map that is then fed as prioris into optimization-based multispectral photometric stereo for a more accurate dense surface normal recovery;(c)obtains camera pose that is subsequently used for conversion of view in the process of fusion where we combine the relative sparse point cloud with the dense surface normal using the automated cross-scale fusion method proposed in this paper to get a dense point cloud with subtle texture information. Experiments show that our method can effectively obtain denser 3D reconstructions.
LGMay 22, 2017
Sparse hierarchical interaction learning with epigraphical projectionMingyuan Jiu, Nelly Pustelnik, Stefan Janaqi et al.
This work focuses on learning optimization problems with quadratical interactions between variables, which go beyond the additive models of traditional linear learning. We investigate more specifically two different methods encountered in the literature to deal with this problem: "hierNet" and structured-sparsity regularization, and study their connections. We propose a primal-dual proximal algorithm based on an epigraphical projection to optimize a general formulation of these learning problems. The experimental setting first highlights the improvement of the proposed procedure compared to state-of-the-art methods based on fast iterative shrinkage-thresholding algorithm (i.e. FISTA) or alternating direction method of multipliers (i.e. ADMM), and then, using the proposed flexible optimization framework, we provide fair comparisons between the different hierarchical penalizations and their improvement over the standard $\ell_1$-norm penalization. The experiments are conducted both on synthetic and real data, and they clearly show that the proposed primal-dual proximal algorithm based on epigraphical projection is efficient and effective to solve and investigate the problem of hierarchical interaction learning.