LGJan 13, 2023Code
A Survey on Self-supervised Learning: Algorithms, Applications, and Future TrendsJie Gui, Tuo Chen, Jing Zhang et al.
Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance. However, the process of collecting and labeling such data can be expensive and time-consuming. Self-supervised learning (SSL), a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying on human-annotated labels. SSL has garnered significant attention recently, leading to the development of numerous related algorithms. However, there is a dearth of comprehensive studies that elucidate the connections and evolution of different SSL variants. This paper presents a review of diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions. Firstly, we provide a detailed introduction to the motivations behind most SSL algorithms and compare their commonalities and differences. Secondly, we explore representative applications of SSL in domains such as image processing, computer vision, and natural language processing. Lastly, we discuss the three primary trends observed in SSL research and highlight the open questions that remain. A curated collection of valuable resources can be accessed at https://github.com/guijiejie/SSL.
CVNov 28, 2022
Exploring the Coordination of Frequency and Attention in Masked Image ModelingJie Gui, Tuo Chen, Minjing Dong et al.
Recently, masked image modeling (MIM), which learns visual representations by reconstructing the masked patches of an image, has dominated self-supervised learning in computer vision. However, the pre-training of MIM always takes massive time due to the large-scale data and large-size backbones. We mainly attribute it to the random patch masking in previous MIM works, which fails to leverage the crucial semantic information for effective visual representation learning. To tackle this issue, we propose the Frequency \& Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches to boost model performance and training efficiency simultaneously. Specifically, FAMT utilizes the self-attention mechanism to extract semantic information from the image for masking during training in an unsupervised manner. However, attention alone could sometimes focus on inappropriate areas regarding the semantic information. Thus, we are motivated to incorporate the information from the frequency domain into the self-attention mechanism to derive the sampling weights for masking, which captures semantic patches for visual representation learning. Furthermore, we introduce a patch throwing strategy based on the derived sampling weights to reduce the training cost. FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works, \emph{e.g.} reducing the training phase time by nearly $50\%$ and improving the linear probing accuracy of MAE by $1.3\% \sim 3.9\%$ across various datasets, including CIFAR-10/100, Tiny ImageNet, and ImageNet-1K. FAMT also demonstrates superior performance in downstream detection and segmentation tasks.
CVAug 19, 2025Code
Backdooring Self-Supervised Contrastive Learning by Noisy AlignmentTuo Chen, Jie Gui, Minjing Dong et al.
Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.
CVJun 3, 2025
NTIRE 2025 XGC Quality Assessment Challenge: Methods and ResultsXiaohong Liu, Xiongkuo Min, Qiang Hu et al.
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.
CVJun 24, 2024
Multi-threshold Deep Metric Learning for Facial Expression RecognitionWenwu Yang, Jinyi Yu, Tuo Chen et al.
Effective expression feature representations generated by a triplet-based deep metric learning are highly advantageous for facial expression recognition (FER). The performance of triplet-based deep metric learning is contingent upon identifying the best threshold for triplet loss. Threshold validation, however, is tough and challenging, as the ideal threshold changes among datasets and even across classes within the same dataset. In this paper, we present the multi-threshold deep metric learning technique, which not only avoids the difficult threshold validation but also vastly increases the capacity of triplet loss learning to construct expression feature representations. We find that each threshold of the triplet loss intrinsically determines a distinctive distribution of inter-class variations and corresponds, thus, to a unique expression feature representation. Therefore, rather than selecting a single optimal threshold from a valid threshold range, we thoroughly sample thresholds across the range, allowing the representation characteristics manifested by thresholds within the range to be fully extracted and leveraged for FER. To realize this approach, we partition the embedding layer of the deep metric learning network into a collection of slices and model training these embedding slices as an end-to-end multi-threshold deep metric learning problem. Each embedding slice corresponds to a sample threshold and is learned by enforcing the corresponding triplet loss, yielding a set of distinct expression features, one for each embedding slice. It makes the embedding layer, which is composed of a set of slices, a more informative and discriminative feature, hence enhancing the FER accuracy. Extensive evaluations demonstrate the superior performance of the proposed approach on both posed and spontaneous facial expression datasets.