IVFeb 24, 2023Code
FedDBL: Communication and Data Efficient Federated Deep-Broad Learning for Histopathological Tissue ClassificationTianpeng Deng, Yanqi Huang, Guoqiang Han et al.
Histopathological tissue classification is a fundamental task in computational pathology. Deep learning-based models have achieved superior performance but centralized training with data centralization suffers from the privacy leakage problem. Federated learning (FL) can safeguard privacy by keeping training samples locally, but existing FL-based frameworks require a large number of well-annotated training samples and numerous rounds of communication which hinder their practicability in the real-world clinical scenario. In this paper, we propose a universal and lightweight federated learning framework, named Federated Deep-Broad Learning (FedDBL), to achieve superior classification performance with limited training samples and only one-round communication. By simply associating a pre-trained deep learning feature extractor, a fast and lightweight broad learning inference system and a classical federated aggregation approach, FedDBL can dramatically reduce data dependency and improve communication efficiency. Five-fold cross-validation demonstrates that FedDBL greatly outperforms the competitors with only one-round communication and limited training samples, while it even achieves comparable performance with the ones under multiple-round communications. Furthermore, due to the lightweight design and one-round communication, FedDBL reduces the communication burden from 4.6GB to only 276.5KB per client using the ResNet-50 backbone at 50-round training. Since no data or deep model sharing across different clients, the privacy issue is well-solved and the model security is guaranteed with no model inversion attack risk. Code is available at https://github.com/tianpeng-deng/FedDBL.
81.1CVJun 3
P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference OptimizationRuipeng Zhang, Zhihao Li, Haozhang Yuan et al.
Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.
IRApr 20, 2022
Broad Recommender System: An Efficient Nonlinear Collaborative Filtering ApproachLing Huang, Can-Rong Guan, Zhen-Wei Huang et al.
Recently, Deep Neural Networks (DNNs) have been widely introduced into Collaborative Filtering (CF) to produce more accurate recommendation results due to their capability of capturing the complex nonlinear relationships between items and users.However, the DNNs-based models usually suffer from high computational complexity, i.e., consuming very long training time and storing huge amount of trainable parameters. To address these problems, we propose a new broad recommender system called Broad Collaborative Filtering (BroadCF), which is an efficient nonlinear collaborative filtering approach. Instead of DNNs, Broad Learning System (BLS) is used as a mapping function to learn the complex nonlinear relationships between users and items, which can avoid the above issues while achieving very satisfactory recommendation performance. However, it is not feasible to directly feed the original rating data into BLS. To this end, we propose a user-item rating collaborative vector preprocessing procedure to generate low-dimensional user-item input data, which is able to harness quality judgments of the most similar users/items. Extensive experiments conducted on seven benchmark datasets have confirmed the effectiveness of the proposed BroadCF algorithm
CVOct 12, 2023
Consistent123: Improve Consistency for One Image to 3D Object SynthesisHaohan Weng, Tianyu Yang, Jianan Wang et al.
Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.
CVDec 2, 2022
Global Learnable Attention for Single Image Super-ResolutionJian-Nan Su, Min Gan, Guang-Yong Chen et al.
Self-similarity is valuable to the exploration of non-local textures in single image super-resolution (SISR). Researchers usually assume that the importance of non-local textures is positively related to their similarity scores. In this paper, we surprisingly found that when repairing severely damaged query textures, some non-local textures with low-similarity which are closer to the target can provide more accurate and richer details than the high-similarity ones. In these cases, low-similarity does not mean inferior but is usually caused by different scales or orientations. Utilizing this finding, we proposed a Global Learnable Attention (GLA) to adaptively modify similarity scores of non-local textures during training instead of only using a fixed similarity scoring function such as the dot product. The proposed GLA can explore non-local textures with low-similarity but more accurate details to repair severely damaged textures. Furthermore, we propose to adopt Super-Bit Locality-Sensitive Hashing (SB-LSH) as a preprocessing method for our GLA. With the SB-LSH, the computational complexity of our GLA is reduced from quadratic to asymptotic linear with respect to the image size. In addition, the proposed GLA can be integrated into existing deep SISR models as an efficient general building block. Based on the GLA, we constructed a Deep Learnable Similarity Network (DLSN), which achieves state-of-the-art performance for SISR tasks of different degradation types (e.g. blur and noise). Our code and a pre-trained DLSN have been uploaded to GitHub† for validation.
CVSep 7, 2023
BroadCAM: Outcome-agnostic Class Activation Mapping for Small-scale Weakly Supervised ApplicationsJiatai Lin, Guoqiang Han, Xuemiao Xu et al.
Class activation mapping~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward information), gradients~(backward information), etc. However, when with small-scale data, unstable training may lead to less effective model outcomes and generate unreliable weights, finally resulting in incorrect activation and noisy CAM seeds. In this paper, we propose an outcome-agnostic CAM approach, called BroadCAM, for small-scale weakly supervised applications. Since broad learning system (BLS) is independent to the model learning, BroadCAM can avoid the weights being affected by the unreliable model outcomes when with small-scale data. By evaluating BroadCAM on VOC2012 (natural images) and BCSS-WSSS (medical images) for WSSS and OpenImages30k for WSOL, BroadCAM demonstrates superior performance than existing CAM methods with small-scale data (less than 5\%) in different CNN architectures. It also achieves SOTA performance with large-scale training data. Extensive qualitative comparisons are conducted to demonstrate how BroadCAM activates the high class-relevance feature maps and generates reliable CAMs when with small-scale training data.
CVJan 22Code
Consistency-Regularized GAN for Few-Shot SAR Target RecognitionYikui Zhai, Shikuang Liu, Wenlve Zhou et al.
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: https://github.com/yikuizhai/Cr-GAN.
LGAug 20, 2023
Rethinking Client Drift in Federated Learning: A Logit PerspectiveYunlu Yan, Chun-Mei Feng, Mang Ye et al.
Federated Learning (FL) enables multiple clients to collaboratively learn in a distributed way, allowing for privacy protection. However, the real-world non-IID data will lead to client drift which degrades the performance of FL. Interestingly, we find that the difference in logits between the local and global models increases as the model is continuously updated, thus seriously deteriorating FL performance. This is mainly due to catastrophic forgetting caused by data heterogeneity between clients. To alleviate this problem, we propose a new algorithm, named FedCSD, a Class prototype Similarity Distillation in a federated framework to align the local and global models. FedCSD does not simply transfer global knowledge to local clients, as an undertrained global model cannot provide reliable knowledge, i.e., class similarity information, and its wrong soft labels will mislead the optimization of local models. Concretely, FedCSD introduces a class prototype similarity distillation to align the local logits with the refined global logits that are weighted by the similarity between local logits and the global prototype. To enhance the quality of global logits, FedCSD adopts an adaptive mask to filter out the terrible soft labels of the global models, thereby preventing them to mislead local optimization. Extensive experiments demonstrate the superiority of our method over the state-of-the-art federated learning approaches in various heterogeneous settings. The source code will be released.
LGApr 1, 2023
ConvBLS: An Effective and Efficient Incremental Convolutional Broad Learning System for Image ClassificationChunyu Lei, C. L. Philip Chen, Jifeng Guo et al.
Deep learning generally suffers from enormous computational resources and time-consuming training processes. Broad Learning System (BLS) and its convolutional variants have been proposed to mitigate these issues and have achieved superb performance in image classification. However, the existing convolutional-based broad learning system (C-BLS) either lacks an efficient training method and incremental learning capability or suffers from poor performance. To this end, we propose a convolutional broad learning system (ConvBLS) based on the spherical K-means (SKM) algorithm and two-stage multi-scale (TSMS) feature fusion, which consists of the convolutional feature (CF) layer, convolutional enhancement (CE) layer, TSMS feature fusion layer, and output layer. First, unlike the current C-BLS, the simple yet efficient SKM algorithm is utilized to learn the weights of CF layers. Compared with random filters, the SKM algorithm makes the CF layer learn more comprehensive spatial features. Second, similar to the vanilla BLS, CE layers are established to expand the feature space. Third, the TSMS feature fusion layer is proposed to extract more effective multi-scale features through the integration of CF layers and CE layers. Thanks to the above design and the pseudo-inverse calculation of the output layer weights, our proposed ConvBLS method is unprecedentedly efficient and effective. Finally, the corresponding incremental learning algorithms are presented for rapid remodeling if the model deems to expand. Experiments and comparisons demonstrate the superiority of our method.
CVApr 26, 2024Code
Spatial-frequency Dual-Domain Feature Fusion Network for Low-Light Remote Sensing Image EnhancementZishu Yao, Guodong Fan, Jinfu Fan et al.
Low-light remote sensing images generally feature high resolution and high spatial complexity, with continuously distributed surface features in space. This continuity in scenes leads to extensive long-range correlations in spatial domains within remote sensing images. Convolutional Neural Networks, which rely on local correlations for long-distance modeling, struggle to establish long-range correlations in such images. On the other hand, transformer-based methods that focus on global information face high computational complexities when processing high-resolution remote sensing images. From another perspective, Fourier transform can compute global information without introducing a large number of parameters, enabling the network to more efficiently capture the overall image structure and establish long-range correlations. Therefore, we propose a Dual-Domain Feature Fusion Network (DFFN) for low-light remote sensing image enhancement. Specifically, this challenging task of low-light enhancement is divided into two more manageable sub-tasks: the first phase learns amplitude information to restore image brightness, and the second phase learns phase information to refine details. To facilitate information exchange between the two phases, we designed an information fusion affine block that combines data from different phases and scales. Additionally, we have constructed two dark light remote sensing datasets to address the current lack of datasets in dark light remote sensing image enhancement. Extensive evaluations show that our method outperforms existing state-of-the-art methods. The code is available at https://github.com/iijjlk/DFFN.
LGApr 3, 2023
Properties and Potential Applications of Random Functional-Linked Types of Neural NetworksGuang-Yong Chen, Yong-Hang Yu, Min Gan et al.
Random functional-linked types of neural networks (RFLNNs), e.g., the extreme learning machine (ELM) and broad learning system (BLS), which avoid suffering from a time-consuming training process, offer an alternative way of learning in deep structure. The RFLNNs have achieved excellent performance in various classification and regression tasks, however, the properties and explanations of these networks are ignored in previous research. This paper gives some insights into the properties of RFLNNs from the viewpoints of frequency domain, and discovers the presence of frequency principle in these networks, that is, they preferentially capture low-frequencies quickly and then fit the high frequency components during the training process. These findings are valuable for understanding the RFLNNs and expanding their applications. Guided by the frequency principle, we propose a method to generate a BLS network with better performance, and design an efficient algorithm for solving Poison's equation in view of the different frequency principle presenting in the Jacobi iterative method and BLS network.
CVNov 19, 2024Code
Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation GraphZiyang Chen, Yongjun Zhang, Wenting Li et al.
Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at https://github.com/ZYangChen/MoCha-Stereo.
LGDec 24, 2025
DiEC: Diffusion Embedded ClusteringHaidong Hu, Xiaoyu Zheng, Jin Zhou et al.
Deep clustering methods typically rely on a single, well-defined representation for clustering. In contrast, pretrained diffusion models provide abundant and diverse multi-scale representations across network layers and noise timesteps. However, a key challenge is how to efficiently identify the most clustering-friendly representation in the layer*timestep space. To address this issue, we propose Diffusion Embedded Clustering (DiEC), an unsupervised framework that performs clustering by leveraging optimal intermediate representations from pretrained diffusion models. DiEC systematically evaluates the clusterability of representations along the trajectory of network depth and noise timesteps. Meanwhile, an unsupervised search strategy is designed for recognizing the Clustering-optimal Layer (COL) and Clustering-optimal Timestep (COT) in the layer*timestep space of pretrained diffusion models, aiming to promote clustering performance and reduce computational overhead. DiEC is fine-tuned primarily with a structure-preserving DEC-style KL-divergence objective at the fixed COL + COT, together with a random-timestep diffusion denoising objective to maintain the generative capability of the pretrained model. Without relying on augmentation-based consistency constraints or contrastive learning, DiEC achieves excellent clustering performance across multiple benchmark datasets.
CVJan 23
A Cosine Network for Image Super-ResolutionChunwei Tian, Chengyuan Zhang, Bob Zhang et al.
Deep convolutional neural networks can use hierarchical information to progressively extract structural information to recover high-quality images. However, preserving the effectiveness of the obtained structural information is important in image super-resolution. In this paper, we propose a cosine network for image super-resolution (CSRNet) by improving a network architecture and optimizing the training strategy. To extract complementary homologous structural information, odd and even heterogeneous blocks are designed to enlarge the architectural differences and improve the performance of image super-resolution. Combining linear and non-linear structural information can overcome the drawback of homologous information and enhance the robustness of the obtained structural information in image super-resolution. Taking into account the local minimum of gradient descent, a cosine annealing mechanism is used to optimize the training procedure by performing warm restarts and adjusting the learning rate. Experimental results illustrate that the proposed CSRNet is competitive with state-of-the-art methods in image super-resolution.
CVSep 4, 2025Code
Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change DetectionYijun Zhou, Yikui Zhai, Zilu Ying et al.
Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.
LGNov 20, 2025Code
Labels Matter More Than Models: Quantifying the Benefit of Supervised Time Series Anomaly DetectionZhijie Zhong, Zhiwen Yu, Kaixiang Yang et al.
Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on complex architectures to model normal data distributions. However, this approach often overlooks the significant performance gains available from limited anomaly labels achievable in practical scenarios. This paper challenges the premise that architectural complexity is the optimal path for TSAD. We conduct the first methodical comparison between supervised and unsupervised paradigms and introduce STAND, a streamlined supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds that from architectural innovations; and (3) Practicality: STAND exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity. The code is publicly available at https://github.com/EmorZz1G/STAND.
SDDec 16, 2025
Memo2496: Expert-Annotated Dataset and Dual-View Adaptive Framework for Music Emotion RecognitionQilin Li, C. L. Philip Chen, Tong Zhang
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
CVAug 7, 2025Code
Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change DetectionXiaoyang Zhang, Guodong Fan, Guang-Yong Chen et al.
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.
IVAug 3, 2025Code
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change DetectionChengming Wang, Guodong Fan, Jinjiang Li et al.
With the advancement of remote sensing satellite technology and the rapid progress of deep learning, remote sensing change detection (RSCD) has become a key technique for regional monitoring. Traditional change detection (CD) methods and deep learning-based approaches have made significant contributions to change analysis and detection, however, many outstanding methods still face limitations in the exploration and application of multimodal data. To address this, we propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to further explore the semantic interaction capabilities of multimodal data. Multimodal large language models (MLLM) have attracted widespread attention for their outstanding performance in computer vision, particularly due to their powerful visual-language understanding and dialogic interaction capabilities. Specifically, we design a MLLM-based optimization strategy to generate multimodal textual data from the original CD images, which serve as textual input to MGCR. Visual and textual features are extracted through a dual encoder framework. For the first time in the RSCD task, we introduce a multimodal graph-conditioned vision-language reconstruction mechanism, which is integrated with graph attention to construct a semantic graph-conditioned reconstruction module (SGCM), this module generates vision-language (VL) tokens through graph-based conditions and enables cross-dimensional interaction between visual and textual features via multihead attention. The reconstructed VL features are then deeply fused using the language vision transformer (LViT), achieving fine-grained feature alignment and high-level semantic interaction. Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods. Our code is available on https://github.com/cn-xvkong/MGCR
CVJul 2, 2025Code
DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow RemovalWenjie Liu, Bingshu Wang, Ze Wang et al.
Document shadow removal is a crucial task in the field of document image enhancement. However, existing methods tend to remove shadows with constant color background and ignore color shadows. In this paper, we first design a diffusion model in latent space for document image shadow removal, called DocShaDiffusion. It translates shadow images from pixel space to latent space, enabling the model to more easily capture essential features. To address the issue of color shadows, we design a shadow soft-mask generation module (SSGM). It is able to produce accurate shadow mask and add noise into shadow regions specially. Guided by the shadow mask, a shadow mask-aware guided diffusion module (SMGDM) is proposed to remove shadows from document images by supervising the diffusion and denoising process. We also propose a shadow-robust perceptual feature loss to preserve details and structures in document images. Moreover, we develop a large-scale synthetic document color shadow removal dataset (SDCSRD). It simulates the distribution of realistic color shadows and provides powerful supports for the training of models. Experiments on three public datasets validate the proposed method's superiority over state-of-the-art. Our code and dataset will be publicly available.
CVJan 2, 2025Code
Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching TransformerZiyang Chen, Wenting Li, Yongjun Zhang et al.
Constrained by the low-rank bottleneck inherent in attention mechanisms, current stereo matching transformers suffer from limited nonlinear expressivity, which renders their feature representations sensitive to challenging conditions such as reflections. To overcome this difficulty, we present the Hadamard Attention Recurrent Stereo Transformer (HART). HART includes a novel attention mechanism that incorporates the following components: 1) The Dense Attention Kernel (DAK) maps the attention weight distribution into a high-dimensional space over (0, +$\infty$). By removing the upper bound constraint on attention weights, DAK enables more flexible modeling of complex feature interactions. This reduces feature collinearity. 2) The Multi Kernel & Order Interaction (MKOI) module extends the attention mechanism by unifying semantic and spatial knowledge learning. This integration improves the ability of HART to learn features in binocular images. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at https://github.com/ZYangChen/HART.
CVDec 8, 2020Code
Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain Adaptation using Structurally Regularized Deep ClusteringHui Tang, Xiatian Zhu, Ke Chen et al.
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain, given labeled data on a source domain whose distribution diverges from the target one. Mainstream UDA methods strive to learn domain-aligned features such that classifiers trained on the source features can be readily applied to the target ones. Although impressive results have been achieved, these methods have a potential risk of damaging the intrinsic data structures of target discrimination, raising an issue of generalization particularly for UDA tasks in an inductive setting. To address this issue, we are motivated by a UDA assumption of structural similarity across domains, and propose to directly uncover the intrinsic target discrimination via constrained clustering, where we constrain the clustering solutions using structural source regularization that hinges on the very same assumption. Technically, we propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one, and we thus term our method as H-SRDC. Our hybrid model is based on a deep clustering framework that minimizes the Kullback-Leibler divergence between the distribution of network prediction and an auxiliary one, where we impose structural regularization by learning domain-shared classifier and cluster centroids. By enriching the structural similarity assumption, we are able to extend H-SRDC for a pixel-level UDA task of semantic segmentation. We conduct extensive experiments on seven UDA benchmarks of image classification and semantic segmentation. With no explicit feature alignment, our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings. We make our implementation codes publicly available at https://github.com/huitangtang/H-SRDC.
CRApr 11, 2012Code
A Novel Latin Square Image CipherYue Wu, Yicong Zhou, Joseph P. Noonan et al.
In this paper, we introduce a symmetric-key Latin square image cipher (LSIC) for grayscale and color images. Our contributions to the image encryption community include 1) we develop new Latin square image encryption primitives including Latin Square Whitening, Latin Square S-box and Latin Square P-box ; 2) we provide a new way of integrating probabilistic encryption in image encryption by embedding random noise in the least significant image bit-plane; and 3) we construct LSIC with these Latin square image encryption primitives all on one keyed Latin square in a new loom-like substitution-permutation network. Consequently, the proposed LSIC achieve many desired properties of a secure cipher including a large key space, high key sensitivities, uniformly distributed ciphertext, excellent confusion and diffusion properties, semantically secure, and robustness against channel noise. Theoretical analysis show that the LSIC has good resistance to many attack models including brute-force attacks, ciphertext-only attacks, known-plaintext attacks and chosen-plaintext attacks. Experimental analysis under extensive simulation results using the complete USC-SIPI Miscellaneous image dataset demonstrate that LSIC outperforms or reach state of the art suggested by many peer algorithms. All these analysis and results demonstrate that the LSIC is very suitable for digital image encryption. Finally, we open source the LSIC MATLAB code under webpage https://sites.google.com/site/tuftsyuewu/source-code.
GRNov 11, 2024
Scaling Mesh Generation via Compressive TokenizationHaohan Weng, Zibo Zhao, Biwen Lei et al.
We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75\% compared to the original sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage.
CVApr 27, 2024
FDCE-Net: Underwater Image Enhancement with Embedding Frequency and Dual Color EncoderZheng Cheng, Guodong Fan, Jingchun Zhou et al.
Underwater images often suffer from various issues such as low brightness, color shift, blurred details, and noise due to light absorption and scattering caused by water and suspended particles. Previous underwater image enhancement (UIE) methods have primarily focused on spatial domain enhancement, neglecting the frequency domain information inherent in the images. However, the degradation factors of underwater images are closely intertwined in the spatial domain. Although certain methods focus on enhancing images in the frequency domain, they overlook the inherent relationship between the image degradation factors and the information present in the frequency domain. As a result, these methods frequently enhance certain attributes of the improved image while inadequately addressing or even exacerbating other attributes. Moreover, many existing methods heavily rely on prior knowledge to address color shift problems in underwater images, limiting their flexibility and robustness. In order to overcome these limitations, we propose the Embedding Frequency and Dual Color Encoder Network (FDCE-Net) in our paper. The FDCE-Net consists of two main structures: (1) Frequency Spatial Network (FS-Net) aims to achieve initial enhancement by utilizing our designed Frequency Spatial Residual Block (FSRB) to decouple image degradation factors in the frequency domain and enhance different attributes separately. (2) To tackle the color shift issue, we introduce the Dual-Color Encoder (DCE). The DCE establishes correlations between color and semantic representations through cross-attention and leverages multi-scale image features to guide the optimization of adaptive color query. The final enhanced images are generated by combining the outputs of FS-Net and DCE through a fusion network. These images exhibit rich details, clear textures, low noise and natural colors.
CVFeb 8, 2024
SpirDet: Towards Efficient, Accurate and Lightweight Infrared Small Target DetectorQianchen Mao, Qiang Li, Bingshu Wang et al.
In recent years, the detection of infrared small targets using deep learning methods has garnered substantial attention due to notable advancements. To improve the detection capability of small targets, these methods commonly maintain a pathway that preserves high-resolution features of sparse and tiny targets. However, it can result in redundant and expensive computations. To tackle this challenge, we propose SpirDet, a novel approach for efficient detection of infrared small targets. Specifically, to cope with the computational redundancy issue, we employ a new dual-branch sparse decoder to restore the feature map. Firstly, the fast branch directly predicts a sparse map indicating potential small target locations (occupying only 0.5\% area of the map). Secondly, the slow branch conducts fine-grained adjustments at the positions indicated by the sparse map. Additionally, we design an lightweight DO-RepEncoder based on reparameterization with the Downsampling Orthogonality, which can effectively reduce memory consumption and inference latency. Extensive experiments show that the proposed SpirDet significantly outperforms state-of-the-art models while achieving faster inference speed and fewer parameters. For example, on the IRSTD-1K dataset, SpirDet improves $MIoU$ by 4.7 and has a $7\times$ $FPS$ acceleration compared to the previous state-of-the-art model. The code will be open to the public.
LGDec 18, 2023
AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing SystemChengyuan Zhu, Yiyuan Yang, Kaixiang Yang et al.
The application of artificial intelligence technology has greatly enhanced and fortified the safety of energy pipelines, particularly in safeguarding against external threats. The predominant methods involve the integration of intelligent sensors to detect external vibration, enabling the identification of event types and locations, thereby replacing manual detection methods. However, practical implementation has exposed a limitation in current methods - their constrained ability to accurately discern the spatial dimensions of external signals, which complicates the authentication of threat events. Our research endeavors to overcome the above issues by harnessing deep learning techniques to achieve a more fine-grained recognition and localization process. This refinement is crucial in effectively identifying genuine threats to pipelines, thus enhancing the safety of energy transportation. This paper proposes a radial threat estimation method for energy pipelines based on distributed optical fiber sensing technology. Specifically, we introduce a continuous multi-view and multi-domain feature fusion methodology to extract comprehensive signal features and construct a threat estimation and recognition network. The utilization of collected acoustic signal data is optimized, and the underlying principle is elucidated. Moreover, we incorporate the concept of transfer learning through a pre-trained model, enhancing both recognition accuracy and training efficiency. Empirical evidence gathered from real-world scenarios underscores the efficacy of our method, notably in its substantial reduction of false alarms and remarkable gains in recognition accuracy. More generally, our method exhibits versatility and can be extrapolated to a broader spectrum of recognition tasks and scenarios.
LGApr 14, 2024
Incremental Self-training for Semi-supervised LearningJifeng Guo, Zhulin Liu, Tong Zhang et al.
Semi-supervised learning provides a solution to reduce the dependency of machine learning on labeled data. As one of the efficient semi-supervised techniques, self-training (ST) has received increasing attention. Several advancements have emerged to address challenges associated with noisy pseudo-labels. Previous works on self-training acknowledge the importance of unlabeled data but have not delved into their efficient utilization, nor have they paid attention to the problem of high time consumption caused by iterative learning. This paper proposes Incremental Self-training (IST) for semi-supervised learning to fill these gaps. Unlike ST, which processes all data indiscriminately, IST processes data in batches and priority assigns pseudo-labels to unlabeled samples with high certainty. Then, it processes the data around the decision boundary after the model is stabilized, enhancing classifier performance. Our IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed. Significantly, it outperforms state-of-the-art competitors on three challenging image classification tasks.
LGDec 7, 2024
A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning SystemPengyu Li, Zhijie Zhong, Tong Zhang et al.
Time series anomaly detection (TSAD) has been a research hotspot in both academia and industry in recent years. Deep learning methods have become the mainstream research direction due to their excellent performance. However, new viewpoints have emerged in recent TSAD research. Deep learning is not required for TSAD due to limitations such as slow deep learning speed. The Broad Learning System (BLS) is a shallow network framework that benefits from its ease of optimization and speed. It has been shown to outperform machine learning approaches while remaining competitive with deep learning. Based on the current situation of TSAD, we propose the Contrastive Patch-based Broad Learning System (CPatchBLS). This is a new exploration of patching technique and BLS, providing a new perspective for TSAD. We construct Dual-PatchBLS as a base through patching and Simple Kernel Perturbation (SKP) and utilize contrastive learning to capture the differences between normal and abnormal data under different representations. To compensate for the temporal semantic loss caused by various patching, we propose CPatchBLS with model level integration, which takes advantage of BLS's fast feature to build model-level integration and improve model detection. Using five real-world series anomaly detection datasets, we confirmed the method's efficacy, outperforming previous deep learning and machine learning methods while retaining a high level of computing efficiency.
LGJun 19, 2025
CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG RepresentationsPuchun Liu, C. L. Philip Chen, Yubin He et al.
The difficulty of extracting deep features from EEG data and effectively integrating information from multiple views presents significant challenges for developing a generalizable pretraining framework for EEG representation learning. However, most existing pre-training methods rely solely on the contextual semantics of a single view, failing to capture the complex and synergistic interactions among different perspectives, limiting the expressiveness and generalization of learned representations. To address these issues, this paper proposes CRIA, an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets. In this work, we define cross-view information as the integrated representation that emerges from the interaction among temporal, spectral, and spatial views of EEG signals. The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively, and combines an attention matrix masking strategy based on the information bottleneck principle with a novel viewpoint masking pre-training scheme. Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions, achieving a balanced accuracy of 57.02% for multi-class event classification and 80.03% for anomaly detection, highlighting its strong generalization ability.
IVMar 24, 2025
FACE: Few-shot Adapter with Cross-view Fusion for Cross-subject EEG Emotion RecognitionHaiqi Liu, C. L. Philip Chen, Tong Zhang
Cross-subject EEG emotion recognition is challenged by significant inter-subject variability and intricately entangled intra-subject variability. Existing works have primarily addressed these challenges through domain adaptation or generalization strategies. However, they typically require extensive target subject data or demonstrate limited generalization performance to unseen subjects. Recent few-shot learning paradigms attempt to address these limitations but often encounter catastrophic overfitting during subject-specific adaptation with limited samples. This article introduces the few-shot adapter with a cross-view fusion method called FACE for cross-subject EEG emotion recognition, which leverages dynamic multi-view fusion and effective subject-specific adaptation. Specifically, FACE incorporates a cross-view fusion module that dynamically integrates global brain connectivity with localized patterns via subject-specific fusion weights to provide complementary emotional information. Moreover, the few-shot adapter module is proposed to enable rapid adaptation for unseen subjects while reducing overfitting by enhancing adapter structures with meta-learning. Experimental results on three public EEG emotion recognition benchmarks demonstrate FACE's superior generalization performance over state-of-the-art methods. FACE provides a practical solution for cross-subject scenarios with limited labeled data.
CVMar 4, 2025
10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target DetectionBiqiao Xin, Qianchen Mao, Bingshu Wang et al.
The widespread deployment of Infrared Small-Target Detection (IRSTD) algorithms on edge devices necessitates the exploration of model compression techniques. Binarized neural networks (BNNs) are distinguished by their exceptional efficiency in model compression. However, the small size of infrared targets introduces stringent precision requirements for the IRSTD task, while the inherent precision loss during binarization presents a significant challenge. To address this, we propose the Binarized Infrared Small-Target Detection Network (BiisNet), which preserves the core operations of binarized convolutions while integrating full-precision features into the network's information flow. Specifically, we propose the Dot Binary Convolution, which retains fine-grained semantic information in feature maps while still leveraging the binarized convolution operations. In addition, we introduce a smooth and adaptive Dynamic Softsign function, which provides more comprehensive and progressively finer gradient during backpropagation, enhancing model stability and promoting an optimal weight distribution. Experimental results demonstrate that BiisNet not only significantly outperforms other binary architectures but also has strong competitiveness among state-of-the-art full-precision models.
CLFeb 1, 2025
DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active LearningJiaxin Guo, C. L. Philip Chen, Shuzhen Li et al.
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
LGDec 16, 2025
PathFinder: Advancing Path Loss Prediction for Single-to-Multi-Transmitter ScenarioZhijie Zhong, Zhiwen Yu, Pengyu Li et al.
Radio path loss prediction (RPP) is critical for optimizing 5G networks and enabling IoT, smart city, and similar applications. However, current deep learning-based RPP methods lack proactive environmental modeling, struggle with realistic multi-transmitter scenarios, and generalize poorly under distribution shifts, particularly when training/testing environments differ in building density or transmitter configurations. This paper identifies three key issues: (1) passive environmental modeling that overlooks transmitters and key environmental features; (2) overemphasis on single-transmitter scenarios despite real-world multi-transmitter prevalence; (3) excessive focus on in-distribution performance while neglecting distribution shift challenges. To address these, we propose PathFinder, a novel architecture that actively models buildings and transmitters via disentangled feature encoding and integrates Mask-Guided Low-rank Attention to independently focus on receiver and building regions. We also introduce a Transmitter-Oriented Mixup strategy for robust training and a new benchmark, single-to-multi-transmitter RPP (S2MT-RPP), tailored to evaluate extrapolation performance (multi-transmitter testing after single-transmitter training). Experimental results show PathFinder outperforms state-of-the-art methods significantly, especially in challenging multi-transmitter scenarios. Our code and project site are publicly available at: https://emorzz1g.github.io/PathFinder/.
CVJan 12
Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal ClassificationShu Shen, C. L. Philip Chen, Tong Zhang
Reliable learning on low-quality multimodal data is a widely concerning issue, especially in safety-critical applications. However, multimodal noise poses a major challenge in this domain and leads existing methods to suffer from two key limitations. First, they struggle to reliably remove heterogeneous data noise, hindering robust multimodal representation learning. Second, they exhibit limited adaptability and generalization when encountering previously unseen noise. To address these issues, we propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD). On one hand, TAHCD introduces the Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to reliably remove heterogeneous noise. They account for noise at both global and instance levels and enable jointly removal of modality-specific and cross-modality noise, achieving robust learning. On the other hand, TAHCD introduces test-time cooperative enhancement, which adaptively updates the model in response to input noise in a label-free manner, improving adaptability and generalization. This is achieved by collaboratively enhancing the joint removal process of modality-specific and cross-modality noise across global and instance levels according to sample noise. Experiments on multiple benchmarks demonstrate that the proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
CVAug 27, 2025
AIM: Adaptive Intra-Network Modulation for Balanced Multimodal LearningShu Shen, C. L. Philip Chen, Tong Zhang
Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality's learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality's under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.
CVMar 4, 2025
Exploring Token-Level Augmentation in Vision Transformer for Semi-Supervised Semantic SegmentationDengke Zhang, Quan Tang, Fagui Liu et al.
Semi-supervised semantic segmentation has witnessed remarkable advancements in recent years. However, existing algorithms are based on convolutional neural networks and directly applying them to Vision Transformers poses certain limitations due to conceptual disparities. To this end, we propose TokenMix, a data augmentation technique specifically designed for semi-supervised semantic segmentation with Vision Transformers. TokenMix aligns well with the global attention mechanism by mixing images at the token level, enhancing learning capability for contextual information among image patches. We further incorporate image augmentation and feature augmentation to promote the diversity of augmentation. Moreover, to enhance consistency regularization, we propose a dual-branch framework where each branch applies image and feature augmentation to the input image. We conduct extensive experiments across multiple benchmark datasets, including Pascal VOC 2012, Cityscapes, and COCO. Results suggest that the proposed method outperforms state-of-the-art algorithms with notably observed accuracy improvement, especially under limited fine annotations.
CVFeb 27, 2025
MICINet: Multi-Level Inter-Class Confusing Information Removal for Reliable Multimodal ClassificationTong Zhang, Shu Shen, C. L. Philip Chen
Reliable multimodal learning in the presence of noisy data is a widely concerned issue, especially in safety-critical applications. Many reliable multimodal methods delve into addressing modality-specific or cross-modality noise. However, they fail to handle the coexistence of both types of noise efficiently. Moreover, the lack of comprehensive consideration for noise at both global and individual levels limits their reliability. To address these issues, a reliable multimodal classification method dubbed Multi-Level Inter-Class Confusing Information Removal Network (MICINet) is proposed. MICINet achieves the reliable removal of both types of noise by unifying them into the concept of Inter-class Confusing Information (\textit{ICI}) and eliminating it at both global and individual levels. Specifically, MICINet first reliably learns the global \textit{ICI} distribution through the proposed \textbf{\textit{Global \textbf{ICI} Learning Module}}. Then, it introduces the \textbf{\textit{Global-guided Sample ICI Learning module}} to efficiently remove global-level \textit{ICI} from sample features utilizing the learned global \textit{ICI} distribution. Subsequently, the \textbf{\textit{Sample-adaptive Cross-modality Information Compensation module}} is designed to remove individual-level \textit{ICI} from each sample reliably. This is achieved through interpretable cross-modality information compensation based on the complementary relationship between discriminative features and \textit{ICI} and the perception of the relative quality of modalities introduced by the relative discriminative power. Experiments on four datasets demonstrate that MICINet outperforms other state-of-the-art reliable multimodal classification methods under various noise conditions.
LGJan 28, 2025
Online-BLS: An Accurate and Efficient Online Broad Learning System for Data Stream ClassificationChunyu Lei, Guang-Ze Chen, C. L. Philip Chen et al.
The state-of-the-art online learning models generally conduct a single online gradient descent when a new sample arrives and thus suffer from suboptimal model weights. To this end, we introduce an online broad learning system framework with closed-form solutions for each online update. Different from employing existing incremental broad learning algorithms for online learning tasks, which tend to incur degraded accuracy and expensive online update overhead, we design an effective weight estimation algorithm and an efficient online updating strategy to remedy the above two deficiencies, respectively. Specifically, an effective weight estimation algorithm is first developed by replacing notorious matrix inverse operations with Cholesky decomposition and forward-backward substitution to improve model accuracy. Second, we devise an efficient online updating strategy that dramatically reduces online update time. Theoretical analysis exhibits the splendid error bound and low time complexity of our model. The most popular test-then-training evaluation experiments on various real-world datasets prove its superiority and efficiency. Furthermore, our framework is naturally extended to data stream scenarios with concept drift and exceeds state-of-the-art baselines.
CVDec 19, 2024
Multi-QuAD: Multi-Level Quality-Adaptive Dynamic Network for Reliable Multimodal ClassificationShu Shen, C. L. Philip Chen, Tong Zhang
Multimodal machine learning has achieved remarkable progress in many scenarios, but its reliability is undermined by varying sample quality. This paper finds that existing reliable multimodal classification methods not only fail to provide robust estimation of data quality, but also lack dynamic networks for sample-specific depth and parameters to achieve reliable inference. To this end, a novel framework for multimodal reliable classification termed \textit{Multi-level Quality-Adaptive Dynamic multimodal network} (Multi-QuAD) is proposed. Multi-QuAD first adopts a novel approach based on noise-free prototypes and a classifier-free design to reliably estimate the quality of each sample at both modality and feature levels. It then achieves sample-specific network depth via the \textbf{\textit{Global Confidence Normalized Depth (GCND)}} mechanism. By normalizing depth across modalities and samples, \textit{\textbf{GCND}} effectively mitigates the impact of challenging modality inputs on dynamic depth reliability. Furthermore, Multi-QuAD provides sample-adaptive network parameters via the \textbf{\textit{Layer-wise Greedy Parameter (LGP)}} mechanism driven by feature-level quality. The cross-modality layer-wise greedy strategy in \textbf{\textit{LGP}} designs a reliable parameter prediction paradigm for multimodal networks with variable architecture for the first time. Experiments conducted on four datasets demonstrate that Multi-QuAD significantly outperforms state-of-the-art methods in classification performance and reliability, exhibiting strong adaptability to data with diverse quality.
OCApr 3, 2024
Deep Reinforcement Learning for Traveling Purchaser ProblemsHaofeng Yuan, Rongping Zhu, Wanlu Yang et al.
The traveling purchaser problem (TPP) is an important combinatorial optimization problem with broad applications. Due to the coupling between routing and purchasing, existing works on TPPs commonly address route construction and purchase planning simultaneously, which, however, leads to exact methods with high computational cost and heuristics with sophisticated design but limited performance. In sharp contrast, we propose a novel approach based on deep reinforcement learning (DRL), which addresses route construction and purchase planning separately, while evaluating and optimizing the solution from a global perspective. The key components of our approach include a bipartite graph representation for TPPs to capture the market-product relations, and a policy network that extracts information from the bipartite graph and uses it to sequentially construct the route. One significant advantage of our framework is that we can efficiently construct the route using the policy network, and once the route is determined, the associated purchasing plan can be easily derived through linear programming, while, by leveraging DRL, we can train the policy network towards optimizing the global solution objective. Furthermore, by introducing a meta-learning strategy, the policy network can be trained stably on large-sized TPP instances, and generalize well across instances of varying sizes and distributions, even to much larger instances that are never seen during training. Experiments on various synthetic TPP instances and the TPPLIB benchmark demonstrate that our DRL-based approach can significantly outperform well-established TPP heuristics, reducing the optimality gap by 40%-90%, and also showing an advantage in runtime, especially on large-sized instances.
CVMar 14, 2024
Desigen: A Pipeline for Controllable Design Template GenerationHaohan Weng, Danqing Huang, Yu Qiao et al.
Templates serve as a good starting point to implement a design (e.g., banner, slide) but it takes great effort from designers to manually create. In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. Different from natural images, a background image should preserve enough non-salient space for the overlaying layout elements. To equip existing advanced diffusion-based models with stronger spatial control, we propose two simple but effective techniques to constrain the saliency distribution and reduce the attention weight in desired regions during the background generation process. Then conditioned on the background, we synthesize the layout with a Transformer-based autoregressive generator. To achieve a more harmonious composition, we propose an iterative inference strategy to adjust the synthesized background and layout in multiple rounds. We constructed a design dataset with more than 40k advertisement banners to verify our approach. Extensive experiments demonstrate that the proposed pipeline generates high-quality templates comparable to human designers. More than a single-page design, we further show an application of presentation generation that outputs a set of theme-consistent slides. The data and code are available at https://whaohan.github.io/desigen.
CVMay 25, 2023
High-Similarity-Pass Attention for Single Image Super-ResolutionJian-Nan Su, Min Gan, Guang-Yong Chen et al.
Recent developments in the field of non-local attention (NLA) have led to a renewed interest in self-similarity-based single image super-resolution (SISR). Researchers usually used the NLA to explore non-local self-similarity (NSS) in SISR and achieve satisfactory reconstruction results. However, a surprising phenomenon that the reconstruction performance of the standard NLA is similar to the NLA with randomly selected regions stimulated our interest to revisit NLA. In this paper, we first analyzed the attention map of the standard NLA from different perspectives and discovered that the resulting probability distribution always has full support for every local feature, which implies a statistical waste of assigning values to irrelevant non-local features, especially for SISR which needs to model long-range dependence with a large number of redundant non-local features. Based on these findings, we introduced a concise yet effective soft thresholding operation to obtain high-similarity-pass attention (HSPA), which is beneficial for generating a more compact and interpretable distribution. Furthermore, we derived some key properties of the soft thresholding operation that enable training our HSPA in an end-to-end manner. The HSPA can be integrated into existing deep SISR models as an efficient general building block. In addition, to demonstrate the effectiveness of the HSPA, we constructed a deep high-similarity-pass attention network (HSPAN) by integrating a few HSPAs in a simple backbone. Extensive experimental results demonstrate that HSPAN outperforms state-of-the-art approaches on both quantitative and qualitative evaluations.
CVMay 12, 2023
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual RecognitionHaiqi Liu, C. L. Philip Chen, Xinrong Gong et al.
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches, which may not sufficiently facilitate meaningful object-specific semantic understanding, leading to a reliance on apparent background correlations. Moreover, they primarily rely on high-dimensional local descriptors to construct complex embedding space, potentially limiting the generalization. To address the above challenges, this article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition. RSaD introduces additional saliency-aware supervision via saliency detection to guide the model toward focusing on the intrinsic discriminative regions. Specifically, RSaD utilizes the saliency detection model to emphasize the critical regions of each sub-category, providing additional object-specific information for fine-grained prediction. RSaD transfers such information with two symmetric branches in a mutual learning paradigm. Furthermore, RSaD exploits inter-regional relationships to enhance the informativeness of the representation and subsequently summarize the highlighted details into contextual embeddings to facilitate the effective transfer, enabling quick generalization to novel sub-categories. The proposed approach is empirically evaluated on three widely used benchmarks, demonstrating its superior performance.
SDJan 15, 2022
A Novel Multi-Task Learning Method for Symbolic Music Emotion RecognitionJibao Qiu, C. L. Philip Chen, Tong Zhang
Symbolic Music Emotion Recognition(SMER) is to predict music emotion from symbolic data, such as MIDI and MusicXML. Previous work mainly focused on learning better representation via (mask) language model pre-training but ignored the intrinsic structure of the music, which is extremely important to the emotional expression of music. In this paper, we present a simple multi-task framework for SMER, which incorporates the emotion recognition task with other emotion-related auxiliary tasks derived from the intrinsic structure of the music. The results show that our multi-task framework can be adapted to different models. Moreover, the labels of auxiliary tasks are easy to be obtained, which means our multi-task methods do not require manually annotated labels other than emotion. Conducting on two publicly available datasets (EMOPIA and VGMIDI), the experiments show that our methods perform better in SMER task. Specifically, accuracy has been increased by 4.17 absolute point to 67.58 in EMOPIA dataset, and 1.97 absolute point to 55.85 in VGMIDI dataset. Ablation studies also show the effectiveness of multi-task methods designed in this paper.
CVJan 15, 2022
OneDConv: Generalized Convolution For Transform-Invariant RepresentationTong Zhang, Haohan Weng, Ke Yi et al.
Convolutional Neural Networks (CNNs) have exhibited their great power in a variety of vision tasks. However, the lack of transform-invariant property limits their further applications in complicated real-world scenarios. In this work, we proposed a novel generalized one dimension convolutional operator (OneDConv), which dynamically transforms the convolution kernels based on the input features in a computationally and parametrically efficient manner. The proposed operator can extract the transform-invariant features naturally. It improves the robustness and generalization of convolution without sacrificing the performance on common images. The proposed OneDConv operator can substitute the vanilla convolution, thus it can be incorporated into current popular convolutional architectures and trained end-to-end readily. On several popular benchmarks, OneDConv outperforms the original convolution operation and other proposed models both in canonical and distorted images.
CVJan 13, 2022
A Survey on Masked Facial Detection Methods and Datasets for Fighting Against COVID-19Bingshu Wang, Jiangbin Zheng, C. L. Philip Chen
Coronavirus disease 2019 (COVID-19) continues to pose a great challenge to the world since its outbreak. To fight against the disease, a series of artificial intelligence (AI) techniques are developed and applied to real-world scenarios such as safety monitoring, disease diagnosis, infection risk assessment, lesion segmentation of COVID-19 CT scans,etc. The coronavirus epidemics have forced people wear masks to counteract the transmission of virus, which also brings difficulties to monitor large groups of people wearing masks. In this paper, we primarily focus on the AI techniques of masked facial detection and related datasets. We survey the recent advances, beginning with the descriptions of masked facial detection datasets. Thirteen available datasets are described and discussed in details. Then, the methods are roughly categorized into two classes: conventional methods and neural network-based methods. Conventional methods are usually trained by boosting algorithms with hand-crafted features, which accounts for a small proportion. Neural network-based methods are further classified as three parts according to the number of processing stages. Representative algorithms are described in detail, coupled with some typical techniques that are described briefly. Finally, we summarize the recent benchmarking results, give the discussions on the limitations of datasets and methods, and expand future research directions. To our knowledge, this is the first survey about masked facial detection methods and datasets. Hopefully our survey could provide some help to fight against epidemics.
LGDec 15, 2021
Graph Representation Learning via Contrasting Cluster AssignmentsChunyang Zhang, Hongyu Yao, C. L. Philip Chen et al.
With the rise of contrastive learning, unsupervised graph representation learning has been booming recently, even surpassing the supervised counterparts in some machine learning tasks. Most of existing contrastive models for graph representation learning either focus on maximizing mutual information between local and global embeddings, or primarily depend on contrasting embeddings at node level. However, they are still not exquisite enough to comprehensively explore the local and global views of network topology. Although the former considers local-global relationship, its coarse global information leads to grudging cooperation between local and global views. The latter pays attention to node-level feature alignment, so that the role of global view appears inconspicuous. To avoid falling into these two extreme cases, we propose a novel unsupervised graph representation model by contrasting cluster assignments, called as GRCCA. It is motivated to make good use of local and global information synthetically through combining clustering algorithms and contrastive learning. This not only facilitates the contrastive effect, but also provides the more high-quality graph information. Meanwhile, GRCCA further excavates cluster-level information, which make it get insight to the elusive association between nodes beyond graph topology. Specifically, we first generate two augmented graphs with distinct graph augmentation strategies, then employ clustering algorithms to obtain their cluster assignments and prototypes respectively. The proposed GRCCA further compels the identical nodes from different augmented graphs to recognize their cluster assignments mutually by minimizing a cross entropy loss. To demonstrate its effectiveness, we compare with the state-of-the-art models in three different downstream tasks. The experimental results show that GRCCA has strong competitiveness in most tasks.
CVNov 15, 2021
Stacked BNAS: Rethinking Broad Convolutional Neural Network for Neural Architecture SearchZixiang Ding, Yaran Chen, Nannan Li et al.
Different from other deep scalable architecture-based NAS approaches, Broad Neural Architecture Search (BNAS) proposes a broad scalable architecture which consists of convolution and enhancement blocks, dubbed Broad Convolutional Neural Network (BCNN), as the search space for amazing efficiency improvement. BCNN reuses the topologies of cells in the convolution block so that BNAS can employ few cells for efficient search. Moreover, multi-scale feature fusion and knowledge embedding are proposed to improve the performance of BCNN with shallow topology. However, BNAS suffers some drawbacks: 1) insufficient representation diversity for feature fusion and enhancement and 2) time consumption of knowledge embedding design by human experts. This paper proposes Stacked BNAS, whose search space is a developed broad scalable architecture named Stacked BCNN, with better performance than BNAS. On the one hand, Stacked BCNN treats mini BCNN as a basic block to preserve comprehensive representation and deliver powerful feature extraction ability. For multi-scale feature enhancement, each mini BCNN feeds the outputs of deep and broad cells to the enhancement cell. For multi-scale feature fusion, each mini BCNN feeds the outputs of deep, broad and enhancement cells to the output node. On the other hand, Knowledge Embedding Search (KES) is proposed to learn appropriate knowledge embeddings in a differentiable way. Moreover, the basic unit of KES is an over-parameterized knowledge embedding module that consists of all possible candidate knowledge embeddings. Experimental results show that 1) Stacked BNAS obtains better performance than BNAS-v2 on both CIFAR-10 and ImageNet, 2) the proposed KES algorithm contributes to reducing the parameters of the learned architecture with satisfactory performance, and 3) Stacked BNAS delivers a state-of-the-art efficiency of 0.02 GPU days.
AIFeb 27, 2021
Siamese Labels Auxiliary LearningWenrui Gan, Zhulin Liu, C. L. Philip Chen et al.
In deep learning, auxiliary training has been widely used to assist the training of models. During the training phase, using auxiliary modules to assist training can improve the performance of the model. During the testing phase, auxiliary modules can be removed, so the test parameters are not increased. In this paper, we propose a novel auxiliary training method, Siamese Labels Auxiliary Learning (SiLa). Unlike Deep Mutual Learning (DML), SiLa emphasizes auxiliary learning and can be easily combined with DML. In general, the main work of this paper include: (1) propose SiLa Learning, which improves the performance of common models without increasing test parameters; (2) compares SiLa with DML and proves that SiLa can improve the generalization of the model; (3) SiLa is applied to Dynamic Neural Networks, and proved that SiLa can be used for various types of network structures.