AIJul 31, 2024
The Llama 3 Herd of ModelsAaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri et al. · allen-ai, berkeley
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
CVMar 23, 2022Code
Towards Semi-Supervised Deep Facial Expression Recognition with An Adaptive Confidence MarginHangyu Li, Nannan Wang, Xi Yang et al.
Only parts of unlabeled data are selected to train models for most semi-supervised learning methods, whose confidence scores are usually higher than the pre-defined threshold (i.e., the confidence margin). We argue that the recognition performance should be further improved by making full use of all unlabeled data. In this paper, we learn an Adaptive Confidence Margin (Ada-CM) to fully leverage all unlabeled data for semi-supervised deep facial expression recognition. All unlabeled samples are partitioned into two subsets by comparing their confidence scores with the adaptively learned confidence margin at each training epoch: (1) subset I including samples whose confidence scores are no lower than the margin; (2) subset II including samples whose confidence scores are lower than the margin. For samples in subset I, we constrain their predictions to match pseudo labels. Meanwhile, samples in subset II participate in the feature-level contrastive objective to learn effective facial expression features. We extensively evaluate Ada-CM on four challenging datasets, showing that our method achieves state-of-the-art performance, especially surpassing fully-supervised baselines in a semi-supervised manner. Ablation study further proves the effectiveness of our method. The source code is available at https://github.com/hangyu94/Ada-CM.
IVApr 19, 2022Code
Two-Stream Graph Convolutional Network for Intra-oral Scanner Image SegmentationYue Zhao, Lingming Zhang, Yang Liu et al.
Precise segmentation of teeth from intra-oral scanner images is an essential task in computer-aided orthodontic surgical planning. The state-of-the-art deep learning-based methods often simply concatenate the raw geometric attributes (i.e., coordinates and normal vectors) of mesh cells to train a single-stream network for automatic intra-oral scanner image segmentation. However, since different raw attributes reveal completely different geometric information, the naive concatenation of different raw attributes at the (low-level) input stage may bring unnecessary confusion in describing and differentiating between mesh cells, thus hampering the learning of high-level geometric representations for the segmentation task. To address this issue, we design a two-stream graph convolutional network (i.e., TSGCN), which can effectively handle inter-view confusion between different raw attributes to more effectively fuse their complementary information and learn discriminative multi-view geometric representations. Specifically, our TSGCN adopts two input-specific graph-learning streams to extract complementary high-level geometric representations from coordinates and normal vectors, respectively. Then, these single-view representations are further fused by a self-attention module to adaptively balance the contributions of different views in learning more discriminative multi-view representations for accurate and fully automatic tooth segmentation. We have evaluated our TSGCN on a real-patient dataset of dental (mesh) models acquired by 3D intraoral scanners. Experimental results show that our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation. Github: https://github.com/ZhangLingMing1/TSGCNet.
CVMar 4, 2022Code
Semi-parametric Makeup Transfer via Semantic-aware CorrespondenceMingrui Zhu, Yun Yi, Nannan Wang et al.
The large discrepancy between the source non-makeup image and the reference makeup image is one of the key challenges in makeup transfer. Conventional approaches for makeup transfer either learn disentangled representation or perform pixel-wise correspondence in a parametric way between two images. We argue that non-parametric techniques have a high potential for addressing the pose, expression, and occlusion discrepancies. To this end, this paper proposes a \textbf{S}emi-\textbf{p}arametric \textbf{M}akeup \textbf{T}ransfer (SpMT) method, which combines the reciprocal strengths of non-parametric and parametric mechanisms. The non-parametric component is a novel \textbf{S}emantic-\textbf{a}ware \textbf{C}orrespondence (SaC) module that explicitly reconstructs content representation with makeup representation under the strong constraint of component semantics. The reconstructed representation is desired to preserve the spatial and identity information of the source image while "wearing" the makeup of the reference image. The output image is synthesized via a parametric decoder that draws on the reconstructed representation. Extensive experiments demonstrate the superiority of our method in terms of visual quality, robustness, and flexibility. Code and pre-trained model are available at \url{https://github.com/AnonymScholar/SpMT.
CVDec 30, 2022Code
Hierarchical Forgery Classifier On Multi-modality Face Forgery CluesDecheng Liu, Zeyang Zheng, Chunlei Peng et al.
Face forgery detection plays an important role in personal privacy and social security. With the development of adversarial generative models, high-quality forgery images become more and more indistinguishable from real to humans. Existing methods always regard as forgery detection task as the common binary or multi-label classification, and ignore exploring diverse multi-modality forgery image types, e.g. visible light spectrum and near-infrared scenarios. In this paper, we propose a novel Hierarchical Forgery Classifier for Multi-modality Face Forgery Detection (HFC-MFFD), which could effectively learn robust patches-based hybrid domain representation to enhance forgery authentication in multiple-modality scenarios. The local spatial hybrid domain feature module is designed to explore strong discriminative forgery clues both in the image and frequency domain in local distinct face regions. Furthermore, the specific hierarchical face forgery classifier is proposed to alleviate the class imbalance problem and further boost detection performance. Experimental results on representative multi-modality face forgery datasets demonstrate the superior performance of the proposed HFC-MFFD compared with state-of-the-art algorithms. The source code and models are publicly available at https://github.com/EdWhites/HFC-MFFD.
CVSep 5, 2022Code
LKD-Net: Large Kernel Convolution Network for Single Image DehazingPinjun Luo, Guoqiang Xiao, Xinbo Gao et al.
The deep convolutional neural networks (CNNs)-based single image dehazing methods have achieved significant success. The previous methods are devoted to improving the network's performance by increasing the network's depth and width. The current methods focus on increasing the convolutional kernel size to enhance its performance by benefiting from the larger receptive field. However, directly increasing the size of the convolutional kernel introduces a massive amount of computational overhead and parameters. Thus, a novel Large Kernel Convolution Dehaze Block (LKD Block) consisting of the Decomposition deep-wise Large Kernel Convolution Block (DLKCB) and the Channel Enhanced Feed-forward Network (CEFN) is devised in this paper. The designed DLKCB can split the deep-wise large kernel convolution into a smaller depth-wise convolution and a depth-wise dilated convolution without introducing massive parameters and computational overhead. Meanwhile, the designed CEFN incorporates a channel attention mechanism into Feed-forward Network to exploit significant channels and enhance robustness. By combining multiple LKD Blocks and Up-Down sampling modules, the Large Kernel Convolution Dehaze Network (LKD-Net) is conducted. The evaluation results demonstrate the effectiveness of the designed DLKCB and CEFN, and our LKD-Net outperforms the state-of-the-art. On the SOTS indoor dataset, our LKD-Net dramatically outperforms the Transformer-based method Dehamer with only 1.79% #Param and 48.9% FLOPs. The source code of our LKD-Net is available at https://github.com/SWU-CS-MediaLab/LKD-Net.
CVJul 10, 2024Code
IRSAM: Advancing Segment Anything Model for Infrared Small Target DetectionMingjin Zhang, Yuchun Wang, Jie Guo et al.
The recent Segment Anything Model (SAM) is a significant advancement in natural image segmentation, exhibiting potent zero-shot performance suitable for various downstream image segmentation tasks. However, directly utilizing the pretrained SAM for Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared images. Unlike a visible light camera, a thermal imager reveals an object's temperature distribution by capturing infrared radiation. Small targets often show a subtle temperature transition at the object's boundaries. To address this issue, we propose the IRSAM model for IRSTD, which improves SAM's encoder-decoder architecture to learn better feature representation of infrared small objects. Specifically, we design a Perona-Malik diffusion (PMD)-based block and incorporate it into multiple levels of SAM's encoder to help it capture essential structural features while suppressing noise. Additionally, we devise a Granularity-Aware Decoder (GAD) to fuse the multi-granularity feature from the encoder to capture structural information that may be lost in long-distance modeling. Extensive experiments on the public datasets, including NUAA-SIRST, NUDT-SIRST, and IRSTD-1K, validate the design choice of IRSAM and its significant superiority over representative state-of-the-art methods. The source code are available at: github.com/IPIC-Lab/IRSAM.
CVApr 4, 2023Code
Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object DetectionChuandong Liu, Chenqiang Gao, Fangcen Liu et al.
State-of-the-art 3D object detectors are usually trained on large-scale datasets with high-quality 3D annotations. However, such 3D annotations are often expensive and time-consuming, which may not be practical for real applications. A natural remedy is to adopt semi-supervised learning (SSL) by leveraging a limited amount of labeled samples and abundant unlabeled samples. Current pseudolabeling-based SSL object detection methods mainly adopt a teacher-student framework, with a single fixed threshold strategy to generate supervision signals, which inevitably brings confused supervision when guiding the student network training. Besides, the data augmentation of the point cloud in the typical teacher-student framework is too weak, and only contains basic down sampling and flip-and-shift (i.e., rotate and scaling), which hinders the effective learning of feature information. Hence, we address these issues by introducing a novel approach of Hierarchical Supervision and Shuffle Data Augmentation (HSSDA), which is a simple yet effective teacher-student framework. The teacher network generates more reasonable supervision for the student network by designing a dynamic dual-threshold strategy. Besides, the shuffle data augmentation strategy is designed to strengthen the feature representation ability of the student network. Extensive experiments show that HSSDA consistently outperforms the recent state-of-the-art methods on different datasets. The code will be released at https://github.com/azhuantou/HSSDA.
CVAug 22, 2023Code
Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic SegmentationZongyi Xu, Bo Yuan, Shanshan Zhao et al.
Impressive performance on point cloud semantic segmentation has been achieved by fully-supervised methods with large amounts of labelled data. As it is labour-intensive to acquire large-scale point cloud data with point-wise labels, many attempts have been made to explore learning 3D point cloud segmentation with limited annotations. Active learning is one of the effective strategies to achieve this purpose but is still under-explored. The most recent methods of this kind measure the uncertainty of each pre-divided region for manual labelling but they suffer from redundant information and require additional efforts for region division. This paper aims at addressing this issue by developing a hierarchical point-based active learning strategy. Specifically, we measure the uncertainty for each point by a hierarchical minimum margin uncertainty module which considers the contextual information at multiple levels. Then, a feature-distance suppression strategy is designed to select important and representative points for manual labelling. Besides, to better exploit the unlabelled data, we build a semi-supervised segmentation framework based on our active strategy. Extensive experiments on the S3DIS and ScanNetV2 datasets demonstrate that the proposed framework achieves 96.5% and 100% performance of fully-supervised baseline with only 0.07% and 0.1% training data, respectively, outperforming the state-of-the-art weakly-supervised and active learning methods. The code will be available at https://github.com/SmiletoE/HPAL.
CVSep 7, 2024Code
Unleashing the Power of Generic Segmentation Models: A Simple Baseline for Infrared Small Target DetectionMingjin Zhang, Chi Zhang, Qiming Zhang et al.
Recent advancements in deep learning have greatly advanced the field of infrared small object detection (IRSTD). Despite their remarkable success, a notable gap persists between these IRSTD methods and generic segmentation approaches in natural image domains. This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. Our investigation reveals that many generic segmentation models can achieve comparable performance to state-of-the-art IRSTD methods. However, their full potential in IRSTD remains untapped. To address this, we propose a simple, lightweight, yet effective baseline model for segmenting small infrared objects. Through appropriate distillation strategies, we empower smaller student models to outperform state-of-the-art methods, even surpassing fine-tuned teacher results. Furthermore, we enhance the model's performance by introducing a novel query design comprising dense and sparse queries to effectively encode multi-scale features. Through extensive experimentation across four popular IRSTD datasets, our model demonstrates significantly improved performance in both accuracy and throughput compared to existing approaches, surpassing SAM and Semantic-SAM by over 14 IoU on NUDT and 4 IoU on IRSTD1k. The source code and models will be released at https://github.com/O937-blip/SimIR.
CVSep 30, 2022Code
Hiding Visual Information via Obfuscating Adversarial PerturbationsZhigang Su, Dawei Zhou, Nannan Wangu et al.
Growing leakage and misuse of visual information raise security and privacy concerns, which promotes the development of information protection. Existing adversarial perturbations-based methods mainly focus on the de-identification against deep learning models. However, the inherent visual information of the data has not been well protected. In this work, inspired by the Type-I adversarial attack, we propose an adversarial visual information hiding method to protect the visual privacy of data. Specifically, the method generates obfuscating adversarial perturbations to obscure the visual information of the data. Meanwhile, it maintains the hidden objectives to be correctly predicted by models. In addition, our method does not modify the parameters of the applied model, which makes it flexible for different scenarios. Experimental results on the recognition and classification tasks demonstrate that the proposed method can effectively hide visual information and hardly affect the performances of models. The code is available in the supplementary material.
LGJun 6, 2022
Instance-Dependent Label-Noise Learning with Manifold-Regularized Transition Matrix EstimationDe Cheng, Tongliang Liu, Yixiong Ning et al.
In label-noise learning, estimating the transition matrix has attracted more and more attention as the matrix plays an important role in building statistically consistent classifiers. However, it is very challenging to estimate the transition matrix T(x), where x denotes the instance, because it is unidentifiable under the instance-dependent noise(IDN). To address this problem, we have noticed that, there are psychological and physiological evidences showing that we humans are more likely to annotate instances of similar appearances to the same classes, and thus poor-quality or ambiguous instances of similar appearances are easier to be mislabeled to the correlated or same noisy classes. Therefore, we propose assumption on the geometry of T(x) that "the closer two instances are, the more similar their corresponding transition matrices should be". More specifically, we formulate above assumption into the manifold embedding, to effectively reduce the degree of freedom of T(x) and make it stably estimable in practice. The proposed manifold-regularized technique works by directly reducing the estimation error without hurting the approximation error about the estimation problem of T(x). Experimental evaluations on four synthetic and two real-world datasets demonstrate that our method is superior to state-of-the-art approaches for label-noise learning under the challenging IDN.
LGOct 6, 2022Code
GBSVM: Granular-ball Support Vector MachineShuyin Xia, Xiaoyu Lian, Guoyin Wang et al.
GBSVM (Granular-ball Support Vector Machine) is a significant attempt to construct a classifier using the coarse-to-fine granularity of a granular-ball as input, rather than a single data point. It is the first classifier whose input contains no points. However, the existing model has some errors, and its dual model has not been derived. As a result, the current algorithm cannot be implemented or applied. To address these problems, this paper has fixed the errors of the original model of the existing GBSVM, and derived its dual model. Furthermore, a particle swarm optimization algorithm is designed to solve the dual model. The sequential minimal optimization algorithm is also carefully designed to solve the dual model. The solution is faster and more stable than the particle swarm optimization based version. The experimental results on the UCI benchmark datasets demonstrate that GBSVM has good robustness and efficiency. All codes have been released in the open source library at http://www.cquptshuyinxia.com/GBSVM.html or https://github.com/syxiaa/GBSVM.
CVJul 10, 2024Code
SUMix: Mixup with Semantic and Uncertain InformationHuafeng Qin, Xin Jin, Hongyu Zhu et al.
Mixup data augmentation approaches have been applied for various tasks of deep learning to improve the generalization ability of deep neural networks. Some existing approaches CutMix, SaliencyMix, etc. randomly replace a patch in one image with patches from another to generate the mixed image. Similarly, the corresponding labels are linearly combined by a fixed ratio $λ$ by l. The objects in two images may be overlapped during the mixing process, so some semantic information is corrupted in the mixed samples. In this case, the mixed image does not match the mixed label information. Besides, such a label may mislead the deep learning model training, which results in poor performance. To solve this problem, we proposed a novel approach named SUMix to learn the mixing ratio as well as the uncertainty for the mixed samples during the training process. First, we design a learnable similarity function to compute an accurate mix ratio. Second, an approach is investigated as a regularized term to model the uncertainty of the mixed samples. We conduct experiments on five image benchmarks, and extensive experimental results imply that our method is capable of improving the performance of classifiers with different cutting-based mixup approaches. The source code is available at https://github.com/JinXins/SUMix.
CLDec 4, 2022
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUEQihuang Zhong, Liang Ding, Yibing Zhan et al.
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.
CVAug 14, 2024Code
One Step Diffusion-based Super-Resolution with Time-Aware DistillationXiao He, Huaao Tang, Zhijun Tu et al.
Diffusion-based image super-resolution (SR) methods have shown promise in reconstructing high-resolution images with fine details from low-resolution counterparts. However, these approaches typically require tens or even hundreds of iterative samplings, resulting in significant latency. Recently, techniques have been devised to enhance the sampling efficiency of diffusion-based SR models via knowledge distillation. Nonetheless, when aligning the knowledge of student and teacher models, these solutions either solely rely on pixel-level loss constraints or neglect the fact that diffusion models prioritize varying levels of information at different time steps. To accomplish effective and efficient image super-resolution, we propose a time-aware diffusion distillation method, named TAD-SR. Specifically, we introduce a novel score distillation strategy to align the data distribution between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy enables the student network to concentrate more on the high-frequency details. Furthermore, to mitigate performance limitations stemming from distillation, we integrate a latent adversarial loss and devise a time-aware discriminator that leverages diffusion priors to effectively distinguish between real images and generated images. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method achieves comparable or even superior performance compared to both previous state-of-the-art (SOTA) methods and the teacher model in just one sampling step. Codes are available at https://github.com/LearningHx/TAD-SR.
CVOct 18, 2022
FedForgery: Generalized Face Forgery Detection with Residual Federated LearningDecheng Liu, Zhan Dang, Chunlei Peng et al.
With the continuous development of deep learning in the field of image generation models, a large number of vivid forged faces have been generated and spread on the Internet. These high-authenticity artifacts could grow into a threat to society security. Existing face forgery detection methods directly utilize the obtained public shared or centralized data for training but ignore the personal privacy and security issues when personal data couldn't be centralizedly shared in real-world scenarios. Additionally, different distributions caused by diverse artifact types would further bring adverse influences on the forgery detection task. To solve the mentioned problems, the paper proposes a novel generalized residual Federated learning for face Forgery detection (FedForgery). The designed variational autoencoder aims to learn robust discriminative residual feature maps to detect forgery faces (with diverse or even unknown artifact types). Furthermore, the general federated learning strategy is introduced to construct distributed detection model trained collaboratively with multiple local decentralized devices, which could further boost the representation generalization. Experiments conducted on publicly available face forgery detection datasets prove the superior performance of the proposed FedForgery. The designed novel generalized face forgery detection protocols and source code would be publicly available.
CVJul 25, 2022
Seeking Subjectivity in Visual Emotion Distribution LearningJingyuan Yang, Jie Li, Leida Li et al.
Visual Emotion Analysis (VEA), which aims to predict people's emotions towards different visual stimuli, has become an attractive research topic recently. Rather than a single label classification task, it is more rational to regard VEA as a Label Distribution Learning (LDL) problem by voting from different individuals. Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process. In psychology, the \textit{Object-Appraisal-Emotion} model has demonstrated that each individual's emotion is affected by his/her subjective appraisal, which is further formed by the affective memory. Inspired by this, we propose a novel \textit{Subjectivity Appraise-and-Match Network (SAMNet)} to investigate the subjectivity in visual emotion distribution. To depict the diversity in crowd voting process, we first propose the \textit{Subjectivity Appraising} with multiple branches, where each branch simulates the emotion evocation process of a specific individual. Specifically, we construct the affective memory with an attention-based mechanism to preserve each individual's unique emotional experience. A subjectivity loss is further proposed to guarantee the divergence between different individuals. Moreover, we propose the \textit{Subjectivity Matching} with a matching loss, aiming at assigning unordered emotion labels to ordered individual predictions in a one-to-one correspondence with the Hungarian algorithm. Extensive experiments and comparisons are conducted on public visual emotion distribution datasets, and the results demonstrate that the proposed SAMNet consistently outperforms the state-of-the-art methods. Ablation study verifies the effectiveness of our method and visualization proves its interpretability.
LGJul 25, 2022
Improving Adversarial Robustness via Mutual Information EstimationDawei Zhou, Nannan Wang, Xinbo Gao et al.
Deep neural networks (DNNs) are found to be vulnerable to adversarial noise. They are typically misled by adversarial samples to make wrong predictions. To alleviate this negative effect, in this paper, we investigate the dependence between outputs of the target model and input adversarial samples from the perspective of information theory, and propose an adversarial defense method. Specifically, we first measure the dependence by estimating the mutual information (MI) between outputs and the natural patterns of inputs (called natural MI) and MI between outputs and the adversarial patterns of inputs (called adversarial MI), respectively. We find that adversarial samples usually have larger adversarial MI and smaller natural MI compared with those w.r.t. natural samples. Motivated by this observation, we propose to enhance the adversarial robustness by maximizing the natural MI and minimizing the adversarial MI during the training process. In this way, the target model is expected to pay more attention to the natural pattern that contains objective semantics. Empirical evaluations demonstrate that our method could effectively improve the adversarial accuracy against multiple attacks.
CVSep 11, 2023
Diff-Privacy: Diffusion-based Face Privacy ProtectionXiao He, Mingrui Zhu, Dongxin Chen et al.
Privacy protection has become a top priority as the proliferation of AI techniques has led to widespread collection and misuse of personal data. Anonymization and visual identity information hiding are two important facial privacy protection tasks that aim to remove identification characteristics from facial images at the human perception level. However, they have a significant difference in that the former aims to prevent the machine from recognizing correctly, while the latter needs to ensure the accuracy of machine recognition. Therefore, it is difficult to train a model to complete these two tasks simultaneously. In this paper, we unify the task of anonymization and visual identity information hiding and propose a novel face privacy protection method based on diffusion models, dubbed Diff-Privacy. Specifically, we train our proposed multi-scale image inversion module (MSI) to obtain a set of SDM format conditional embeddings of the original image. Based on the conditional embeddings, we design corresponding embedding scheduling strategies and construct different energy functions during the denoising process to achieve anonymization and visual identity information hiding. Extensive experiments have been conducted to validate the effectiveness of our proposed framework in protecting facial privacy.
LGMar 2, 2023Code
GBMST: An Efficient Minimum Spanning Tree Clustering Based on Granular-Ball ComputingJiang Xie, Shuyin Xia, Guoyin Wang et al.
Most of the existing clustering methods are based on a single granularity of information, such as the distance and density of each data. This most fine-grained based approach is usually inefficient and susceptible to noise. Therefore, we propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST). We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority", which can greatly avoid the influence of outliers and accelerate the construction process of MST. Experimental results on several data sets demonstrate the power of the algorithm. All codes have been released at https://github.com/xjnine/GBMST.
CVDec 8, 2022
All-to-key Attention for Arbitrary Style TransferMingrui Zhu, Xiao He, Nannan Wang et al.
Attention-based arbitrary style transfer studies have shown promising performance in synthesizing vivid local style details. They typically use the all-to-all attention mechanism -- each position of content features is fully matched to all positions of style features. However, all-to-all attention tends to generate distorted style patterns and has quadratic complexity, limiting the effectiveness and efficiency of arbitrary style transfer. In this paper, we propose a novel all-to-key attention mechanism -- each position of content features is matched to stable key positions of style features -- that is more in line with the characteristics of style transfer. Specifically, it integrates two newly proposed attention forms: distributed and progressive attention. Distributed attention assigns attention to key style representations that depict the style distribution of local regions; Progressive attention pays attention from coarse-grained regions to fine-grained key positions. The resultant module, dubbed StyA2K, shows extraordinary performance in preserving the semantic structure and rendering consistent style patterns. Qualitative and quantitative comparisons with state-of-the-art methods demonstrate the superior performance of our approach.
83.2CVMar 18Code
Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code GenerationJiawei Zhou, Chi Zhang, Xiang Feng et al.
We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception -- to parse intricate spatial hierarchies and symbolic details -- and precise generative expression -- to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content -- from scientific visualizations to complex symbolic notations -- each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.
CVMar 29, 2022
Robust Single Image Dehazing Based on Consistent and Contrast-Assisted ReconstructionDe Cheng, Yan Li, Dingwen Zhang et al.
Single image dehazing as a fundamental low-level vision task, is essential for the development of robust intelligent surveillance system. In this paper, we make an early effort to consider dehazing robustness under variational haze density, which is a realistic while under-studied problem in the research filed of singe image dehazing. To properly address this problem, we propose a novel density-variational learning framework to improve the robustness of the image dehzing model assisted by a variety of negative hazy images, to better deal with various complex hazy scenarios. Specifically, the dehazing network is optimized under the consistency-regularized framework with the proposed Contrast-Assisted Reconstruction Loss (CARL). The CARL can fully exploit the negative information to facilitate the traditional positive-orient dehazing objective function, by squeezing the dehazed image to its clean target from different directions. Meanwhile, the consistency regularization keeps consistent outputs given multi-level hazy images, thus improving the model robustness. Extensive experimental results on two synthetic and three real-world datasets demonstrate that our method significantly surpasses the state-of-the-art approaches.
CVJul 6, 2023
MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake DetectionRuiyang Xia, Decheng Liu, Jie Li et al.
Advanced manipulation techniques have provided criminals with opportunities to make social panic or gain illicit profits through the generation of deceptive media, such as forged face images. In response, various deepfake detection methods have been proposed to assess image authenticity. Sequential deepfake detection, which is an extension of deepfake detection, aims to identify forged facial regions with the correct sequence for recovery. Nonetheless, due to the different combinations of spatial and sequential manipulations, forged face images exhibit substantial discrepancies that severely impact detection performance. Additionally, the recovery of forged images requires knowledge of the manipulation model to implement inverse transformations, which is difficult to ascertain as relevant techniques are often concealed by attackers. To address these issues, we propose Multi-Collaboration and Multi-Supervision Network (MMNet) that handles various spatial scales and sequential permutations in forged face images and achieve recovery without requiring knowledge of the corresponding manipulation method. Furthermore, existing evaluation metrics only consider detection accuracy at a single inferring step, without accounting for the matching degree with ground-truth under continuous multiple steps. To overcome this limitation, we propose a novel evaluation metric called Complete Sequence Matching (CSM), which considers the detection accuracy at multiple inferring steps, reflecting the ability to detect integrally forged sequences. Extensive experiments on several typical datasets demonstrate that MMNet achieves state-of-the-art detection performance and independent recovery performance.
CVJul 5, 2022
Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in VIS and NIR ScenarioYukai Wang, Chunlei Peng, Decheng Liu et al.
In recent years, with the rapid development of face editing and generation, more and more fake videos are circulating on social media, which has caused extreme public concerns. Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images. But for synthesized videos, these methods only confine to single frame and pay little attention to the most discriminative part and temporal frequency clue among different frames. To take full advantage of the rich information in video sequences, this paper performs video forgery detection on both spatial and temporal frequency domains and proposes a Discrete Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation. FCAN-DCT consists of a backbone network and two branches: Compact Feature Extraction (CFE) module and Frequency Temporal Attention (FTA) module. We conduct thorough experimental assessments on two visible light (VIS) based datasets WildDeepfake and Celeb-DF (v2), and our self-built video forgery dataset DeepfakeNIR, which is the first video forgery dataset on near-infrared modality. The experimental results demonstrate the effectiveness of our method on detecting forgery videos in both VIS and NIR scenarios.
CVJul 26, 2023
ESSAformer: Efficient Transformer for Hyperspectral Image Super-resolutionMingjin Zhang, Chi Zhang, Qiming Zhang et al.
Single hyperspectral image super-resolution (single-HSI-SR) aims to restore a high-resolution hyperspectral image from a low-resolution observation. However, the prevailing CNN-based approaches have shown limitations in building long-range dependencies and capturing interaction information between spectral features. This results in inadequate utilization of spectral information and artifacts after upsampling. To address this issue, we propose ESSAformer, an ESSA attention-embedded Transformer network for single-HSI-SR with an iterative refining structure. Specifically, we first introduce a robust and spectral-friendly similarity metric, \ie, the spectral correlation coefficient of the spectrum (SCC), to replace the original attention matrix and incorporates inductive biases into the model to facilitate training. Built upon it, we further utilize the kernelizable attention technique with theoretical support to form a novel efficient SCC-kernel-based self-attention (ESSA) and reduce attention computation to linear complexity. ESSA enlarges the receptive field for features after upsampling without bringing much computation and allows the model to effectively utilize spatial-spectral information from different scales, resulting in the generation of more natural high-resolution images. Without the need for pretraining on large-scale datasets, our experiments demonstrate ESSA's effectiveness in both visual quality and quantitative results.
IVApr 22, 2023
SAWU-Net: Spatial Attention Weighted Unmixing Network for Hyperspectral ImagesLin Qi, Xuewen Qin, Feng Gao et al.
Hyperspectral unmixing is a critical yet challenging task in hyperspectral image interpretation. Recently, great efforts have been made to solve the hyperspectral unmixing task via deep autoencoders. However, existing networks mainly focus on extracting spectral features from mixed pixels, and the employment of spatial feature prior knowledge is still insufficient. To this end, we put forward a spatial attention weighted unmixing network, dubbed as SAWU-Net, which learns a spatial attention network and a weighted unmixing network in an end-to-end manner for better spatial feature exploitation. In particular, we design a spatial attention module, which consists of a pixel attention block and a window attention block to efficiently model pixel-based spectral information and patch-based spatial information, respectively. While in the weighted unmixing framework, the central pixel abundance is dynamically weighted by the coarse-grained abundances of surrounding pixels. In addition, SAWU-Net generates dynamically adaptive spatial weights through the spatial attention mechanism, so as to dynamically integrate surrounding pixels more effectively. Experimental results on real and synthetic datasets demonstrate the better accuracy and superiority of SAWU-Net, which reflects the effectiveness of the proposed spatial attention mechanism.
CVJan 24, 2023
Few-shot Font Generation by Learning Style Difference and SimilarityXiao He, Mingrui Zhu, Nannan Wang et al.
Few-shot font generation (FFG) aims to preserve the underlying global structure of the original character while generating target fonts by referring to a few samples. It has been applied to font library creation, a personalized signature, and other scenarios. Existing FFG methods explicitly disentangle content and style of reference glyphs universally or component-wisely. However, they ignore the difference between glyphs in different styles and the similarity of glyphs in the same style, which results in artifacts such as local distortions and style inconsistency. To address this issue, we propose a novel font generation approach by learning the Difference between different styles and the Similarity of the same style (DS-Font). We introduce contrastive learning to consider the positive and negative relationship between styles. Specifically, we propose a multi-layer style projector for style encoding and realize a distinctive style representation via our proposed Cluster-level Contrastive Style (CCS) loss. In addition, we design a multi-task patch discriminator, which comprehensively considers different areas of the image and ensures that each style can be distinguished independently. We conduct qualitative and quantitative evaluations comprehensively to demonstrate that our approach achieves significantly better results than state-of-the-art methods.
CVApr 25, 2023
Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency ConstraintGuozhang Li, De Cheng, Xinpeng Ding et al.
Weakly Supervised Temporal Action Localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classificationproblem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be sub-optimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene ( i.e., the scene same as positive actions) as co-scene actions, this sub-optimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC firstly adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video; Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches, and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
LGMar 29, 2023
Multi-View Clustering via Semi-non-negative Tensor FactorizationJing Li, Quanxue Gao, Qianqian Wang et al.
Multi-view clustering (MVC) based on non-negative matrix factorization (NMF) and its variants have received a huge amount of attention in recent years due to their advantages in clustering interpretability. However, existing NMF-based multi-view clustering methods perform NMF on each view data respectively and ignore the impact of between-view. Thus, they can't well exploit the within-view spatial structure and between-view complementary information. To resolve this issue, we present semi-non-negative tensor factorization (Semi-NTF) and develop a novel multi-view clustering based on Semi-NTF with one-side orthogonal constraint. Our model directly performs Semi-NTF on the 3rd-order tensor which is composed of anchor graphs of views. Thus, our model directly considers the between-view relationship. Moreover, we use the tensor Schatten p-norm regularization as a rank approximation of the 3rd-order tensor which characterizes the cluster structure of multi-view data and exploits the between-view complementary information. In addition, we provide an optimization algorithm for the proposed method and prove mathematically that the algorithm always converges to the stationary KKT point. Extensive experiments on various benchmark datasets indicate that our proposed method is able to achieve satisfactory clustering performance.
CVNov 30, 2022
Neighbour Consistency Guided Pseudo-Label Refinement for Unsupervised Person Re-IdentificationDe Cheng, Haichun Tai, Nannan Wang et al.
Unsupervised person re-identification (ReID) aims at learning discriminative identity features for person retrieval without any annotations. Recent advances accomplish this task by leveraging clustering-based pseudo labels, but these pseudo labels are inevitably noisy which deteriorate model performance. In this paper, we propose a Neighbour Consistency guided Pseudo Label Refinement (NCPLR) framework, which can be regarded as a transductive form of label propagation under the assumption that the prediction of each example should be similar to its nearest neighbours'. Specifically, the refined label for each training instance can be obtained by the original clustering result and a weighted ensemble of its neighbours' predictions, with weights determined according to their similarities in the feature space. In addition, we consider the clustering-based unsupervised person ReID as a label-noise learning problem. Then, we proposed an explicit neighbour consistency regularization to reduce model susceptibility to over-fitting while improving the training stability. The NCPLR method is simple yet effective, and can be seamlessly integrated into existing clustering-based unsupervised algorithms. Extensive experimental results on five ReID datasets demonstrate the effectiveness of the proposed method, and showing superior performance to state-of-the-art methods by a large margin.
CVNov 24, 2023
CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image PersonalizationRuoyu Zhao, Mingrui Zhu, Shiyin Dong et al.
We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.
CVNov 13, 2023
GazeForensics: DeepFake Detection via Gaze-guided Spatial Inconsistency LearningQinlin He, Chunlei Peng, Decheng Liu et al.
DeepFake detection is pivotal in personal privacy and public safety. With the iterative advancement of DeepFake techniques, high-quality forged videos and images are becoming increasingly deceptive. Prior research has seen numerous attempts by scholars to incorporate biometric features into the field of DeepFake detection. However, traditional biometric-based approaches tend to segregate biometric features from general ones and freeze the biometric feature extractor. These approaches resulted in the exclusion of valuable general features, potentially leading to a performance decline and, consequently, a failure to fully exploit the potential of biometric information in assisting DeepFake detection. Moreover, insufficient attention has been dedicated to scrutinizing gaze authenticity within the realm of DeepFake detection in recent years. In this paper, we introduce GazeForensics, an innovative DeepFake detection method that utilizes gaze representation obtained from a 3D gaze estimation model to regularize the corresponding representation within our DeepFake detection model, while concurrently integrating general features to further enhance the performance of our model. Experiment results reveal that our proposed GazeForensics outperforms the current state-of-the-art methods.
CVJun 14, 2023
SMC-UDA: Structure-Modal Constraint for Unsupervised Cross-Domain Renal SegmentationZhusi Zhong, Jie Li, Lulu Bi et al.
Medical image segmentation based on deep learning often fails when deployed on images from a different domain. The domain adaptation methods aim to solve domain-shift challenges, but still face some problems. The transfer learning methods require annotation on the target domain, and the generative unsupervised domain adaptation (UDA) models ignore domain-specific representations, whose generated quality highly restricts segmentation performance. In this study, we propose a novel Structure-Modal Constrained (SMC) UDA framework based on a discriminative paradigm and introduce edge structure as a bridge between domains. The proposed multi-modal learning backbone distills structure information from image texture to distinguish domain-invariant edge structure. With the structure-constrained self-learning and progressive ROI, our methods segment the kidney by locating the 3D spatial structure of the edge. We evaluated SMC-UDA on public renal segmentation datasets, adapting from the labeled source domain (CT) to the unlabeled target domain (CT/MRI). The experiments show that our proposed SMC-UDA has a strong generalization and outperforms generative UDA methods.
CVJul 21, 2023
Attention Consistency Refined Masked Frequency Forgery Representation for Generalizing Face Forgery DetectionDecheng Liu, Tao Chen, Chunlei Peng et al.
Due to the successful development of deep image generation technology, visual data forgery detection would play a more important role in social and economic security. Existing forgery detection methods suffer from unsatisfactory generalization ability to determine the authenticity in the unseen domain. In this paper, we propose a novel Attention Consistency Refined masked frequency forgery representation model toward generalizing face forgery detection algorithm (ACMF). Most forgery technologies always bring in high-frequency aware cues, which make it easy to distinguish source authenticity but difficult to generalize to unseen artifact types. The masked frequency forgery representation module is designed to explore robust forgery cues by randomly discarding high-frequency information. In addition, we find that the forgery attention map inconsistency through the detection network could affect the generalizability. Thus, the forgery attention consistency is introduced to force detectors to focus on similar attention regions for better generalization ability. Experiment results on several public face forgery datasets (FaceForensic++, DFD, Celeb-DF, and WDF datasets) demonstrate the superior performance of the proposed method compared with the state-of-the-art methods.
LGApr 21, 2023
Granular-ball computing: an efficient, robust, and interpretable adaptive multi-granularity representation and computation methodShuyin Xia, Guoyin Wang, Xinbo Gao et al.
Human cognition operates on a "Global-first" cognitive mechanism, prioritizing information processing based on coarse-grained details. This mechanism inherently possesses an adaptive multi-granularity description capacity, resulting in computational traits such as efficiency, robustness, and interpretability. The analysis pattern reliance on the finest granularity and single-granularity makes most existing computational methods less efficient, robust, and interpretable, which is an important reason for the current lack of interpretability in neural networks. Multi-granularity granular-ball computing employs granular-balls of varying sizes to daptively represent and envelop the sample space, facilitating learning based on these granular-balls. Given that the number of coarse-grained "granular-balls" is fewer than sample points, granular-ball computing proves more efficient. Moreover, the inherent coarse-grained nature of granular-balls reduces susceptibility to fine-grained sample disturbances, enhancing robustness. The multi-granularity construct of granular-balls generates topological structures and coarse-grained descriptions, naturally augmenting interpretability. Granular-ball computing has successfully ventured into diverse AI domains, fostering the development of innovative theoretical methods, including granular-ball classifiers, clustering techniques, neural networks, rough sets, and evolutionary computing. This has notably ameliorated the efficiency, noise robustness, and interpretability of traditional methods. Overall, granular-ball computing is a rare and innovative theoretical approach in AI that can adaptively and simultaneously enhance efficiency, robustness, and interpretability. This article delves into the main application landscapes for granular-ball computing, aiming to equip future researchers with references and insights to refine and expand this promising theory.
LGMay 30, 2022
Do Deep Neural Networks Always Perform Better When Eating More Data?Jiachen Yang, Zhuo Zhang, Yicheng Gong et al.
Data has now become a shortcoming of deep learning. Researchers in their own fields share the thinking that "deep neural networks might not always perform better when they eat more data," which still lacks experimental validation and a convincing guiding theory. Here to fill this lack, we design experiments from Identically Independent Distribution(IID) and Out of Distribution(OOD), which give powerful answers. For the purpose of guidance, based on the discussion of results, two theories are proposed: under IID condition, the amount of information determines the effectivity of each sample, the contribution of samples and difference between classes determine the amount of sample information and the amount of class information; under OOD condition, the cross-domain degree of samples determine the contributions, and the bias-fitting caused by irrelevant elements is a significant factor of cross-domain. The above theories provide guidance from the perspective of data, which can promote a wide range of practical applications of artificial intelligence.
88.7LGMay 2
Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual LearningLingfeng He, De Cheng, Huaijie Wang et al.
Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods.
LGOct 30, 2022
Multi-view Multi-label Anomaly Network Traffic Classification based on MLP-Mixer Neural NetworkYu Zheng, Zhangxuan Dang, Chunlei Peng et al.
Network traffic classification is the basis of many network security applications and has attracted enough attention in the field of cyberspace security. Existing network traffic classification based on convolutional neural networks (CNNs) often emphasizes local patterns of traffic data while ignoring global information associations. In this paper, we propose an MLP-Mixer based multi-view multi-label neural network for network traffic classification. Compared with the existing CNN-based methods, our method adopts the MLP-Mixer structure, which is more in line with the structure of the packet than the conventional convolution operation. In our method, one packet is divided into the packet header and the packet body, together with the flow features of the packet as input from different views. We utilize a multi-label setting to learn different scenarios simultaneously to improve the classification performance by exploiting the correlations between different scenarios. Taking advantage of the above characteristics, we propose an end-to-end network traffic classification method. We conduct experiments on three public datasets, and the experimental results show that our method can achieve superior performance.
CVSep 29, 2024
Effective Diffusion Transformer Architecture for Image Super-ResolutionKun Cheng, Lei Yu, Zhijun Tu et al.
Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super-resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocates the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super-resolution.
CVJul 19, 2024
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and EvaluationsDecheng Liu, Zongqi Wang, Chunlei Peng et al.
Due to the successful development of deep image generation technology, forgery detection plays a more important role in social and economic security. Racial bias has not been explored thoroughly in the deep forgery detection field. In the paper, we first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public state-of-the-art (SOTA) methods. Different from existing forgery detection datasets, the self-constructed FairFD dataset contains a balanced racial ratio and diverse forgery generation images with the largest-scale subjects. Additionally, we identify the problems with naive fairness metrics when benchmarking forgery detection models. To comprehensively evaluate fairness, we design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results. We also present an effective and robust post-processing technique, Bias Pruning with Fair Activations (BPFA), which improves fairness without requiring retraining or weight updates. Extensive experiments conducted with 12 representative forgery detection models demonstrate the value of the proposed dataset and the reasonability of the designed fairness metrics. By applying the BPFA to the existing fairest detector, we achieve a new SOTA. Furthermore, we conduct more in-depth analyses to offer more insights to inspire researchers in the community.
CVJul 18, 2023
PRO-Face S: Privacy-preserving Reversible Obfuscation of Face Images via Secure FlowLin Yuan, Kai Liang, Xiao Pu et al.
This paper proposes a novel paradigm for facial privacy protection that unifies multiple characteristics including anonymity, diversity, reversibility and security within a single lightweight framework. We name it PRO-Face S, short for Privacy-preserving Reversible Obfuscation of Face images via Secure flow-based model. In the framework, an Invertible Neural Network (INN) is utilized to process the input image along with its pre-obfuscated form, and generate the privacy protected image that visually approximates to the pre-obfuscated one, thus ensuring privacy. The pre-obfuscation applied can be in diversified form with different strengths and styles specified by users. Along protection, a secret key is injected into the network such that the original image can only be recovered from the protection image via the same model given the correct key provided. Two modes of image recovery are devised to deal with malicious recovery attempts in different scenarios. Finally, extensive experiments conducted on three public image datasets demonstrate the superiority of the proposed framework over multiple state-of-the-art approaches.
CVOct 28, 2022
Contextual Learning in Fourier Complex Field for VHR Remote Sensing ImagesYan Zhang, Xiyuan Gao, Qingyan Duan et al.
Very high-resolution (VHR) remote sensing (RS) image classification is the fundamental task for RS image analysis and understanding. Recently, transformer-based models demonstrated outstanding potential for learning high-order contextual relationships from natural images with general resolution (224x224 pixels) and achieved remarkable results on general image classification tasks. However, the complexity of the naive transformer grows quadratically with the increase in image size, which prevents transformer-based models from VHR RS image (500x500 pixels) classification and other computationally expensive downstream tasks. To this end, we propose to decompose the expensive self-attention (SA) into real and imaginary parts via discrete Fourier transform (DFT) and therefore propose an efficient complex self-attention (CSA) mechanism. Benefiting from the conjugated symmetric property of DFT, CSA is capable to model the high-order contextual information with less than half computations of naive SA. To overcome the gradient explosion in Fourier complex field, we replace the Softmax function with the carefully designed Logmax function to normalize the attention map of CSA and stabilize the gradient propagation. By stacking various layers of CSA blocks, we propose the Fourier Complex Transformer (FCT) model to learn global contextual information from VHR aerial images following the hierarchical manners. Universal experiments conducted on commonly used RS classification data sets demonstrate the effectiveness and efficiency of FCT, especially on very high-resolution RS images.
CVJul 12, 2022
TransFA: Transformer-based Representation for Face Attribute EvaluationDecheng Liu, Weijie He, Chunlei Peng et al.
Face attribute evaluation plays an important role in video surveillance and face analysis. Although methods based on convolution neural networks have made great progress, they inevitably only deal with one local neighborhood with convolutions at a time. Besides, existing methods mostly regard face attribute evaluation as the individual multi-label classification task, ignoring the inherent relationship between semantic attributes and face identity information. In this paper, we propose a novel \textbf{trans}former-based representation for \textbf{f}ace \textbf{a}ttribute evaluation method (\textbf{TransFA}), which could effectively enhance the attribute discriminative representation learning in the context of attention mechanism. The multiple branches transformer is employed to explore the inter-correlation between different attributes in similar semantic regions for attribute feature learning. Specially, the hierarchical identity-constraint attribute loss is designed to train the end-to-end architecture, which could further integrate face identity discriminative information to boost performance. Experimental results on multiple face attribute benchmarks demonstrate that the proposed TransFA achieves superior performances compared with state-of-the-art methods.
CVOct 27, 2023
Shape-centered Representation Learning for Visible-Infrared Person Re-identificationShuang Li, Jiaxu Leng, Ji Gan et al.
Visible-Infrared Person Re-Identification (VI-ReID) plays a critical role in all-day surveillance systems. However, existing methods primarily focus on learning appearance features while overlooking body shape features, which not only complement appearance features but also exhibit inherent robustness to modality variations. Despite their potential, effectively integrating shape and appearance features remains challenging. Appearance features are highly susceptible to modality variations and background noise, while shape features often suffer from inaccurate infrared shape estimation due to the limitations of auxiliary models. To address these challenges, we propose the Shape-centered Representation Learning (ScRL) framework, which enhances VI-ReID performance by innovatively integrating shape and appearance features. Specifically, we introduce Infrared Shape Restoration (ISR) to restore inaccuracies in infrared body shape representations at the feature level by leveraging infrared appearance features. In addition, we propose Shape Feature Propagation (SFP), which enables the direct extraction of shape features from original images during inference with minimal computational complexity. Furthermore, we design Appearance Feature Enhancement (AFE), which utilizes shape features to emphasize shape-related appearance features while effectively suppressing identity-unrelated noise. Benefiting from the effective integration of shape and appearance features, ScRL demonstrates superior performance through extensive experiments. On the SYSU-MM01, HITSZ-VCM, and RegDB datasets, it achieves Rank-1 (mAP) accuracies of 76.1% (72.6%), 71.2% (52.9%), and 92.4% (86.7%), respectively, surpassing existing state-of-the-art methods.
CVSep 14, 2023
Gradient constrained sharpness-aware prompt learning for vision-language modelsLiangchen Liu, Nannan Wang, Dawei Zhou et al.
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM), i.e., improving the performance on unseen classes while maintaining the performance on seen classes. Comparing with existing generalizable methods that neglect the seen classes degradation, the setting of this problem is more strict and fits more closely with practical applications. To solve this problem, we start from the optimization perspective, and leverage the relationship between loss landscape geometry and model generalization ability. By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness, while each of them is indispensable. However, we find the optimizing gradient of existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance. To this end, we propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), to dynamically constrain the optimizing gradient, thus achieving above two-fold optimization objective simultaneously. Extensive experiments verify the effectiveness of GCSCoOp in the trade-off problem.
CVSep 28, 2024
Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence DetectionJiaxu Leng, Zhanjie Wu, Mingpi Tan et al.
While numerous Video Violence Detection (VVD) methods have focused on representation learning in Euclidean space, they struggle to learn sufficiently discriminative features, leading to weaknesses in recognizing normal events that are visually similar to violent events (\emph{i.e.}, ambiguous violence). In contrast, hyperbolic representation learning, renowned for its ability to model hierarchical and complex relationships between events, has the potential to amplify the discrimination between visually similar events. Inspired by these, we develop a novel Dual-Space Representation Learning (DSRL) method for weakly supervised VVD to utilize the strength of both Euclidean and hyperbolic geometries, capturing the visual features of events while also exploring the intrinsic relations between events, thereby enhancing the discriminative capacity of the features. DSRL employs a novel information aggregation strategy to progressively learn event context in hyperbolic spaces, which selects aggregation nodes through layer-sensitive hyperbolic association degrees constrained by hyperbolic Dirichlet energy. Furthermore, DSRL attempts to break the cyber-balkanization of different spaces, utilizing cross-space attention to facilitate information interactions between Euclidean and hyperbolic space to capture better discriminative features for final violence detection. Comprehensive experiments demonstrate the effectiveness of our proposed DSRL.
CVOct 5, 2023
Enhancing Robust Representation in Adversarial Training: Alignment and Exclusion CriteriaNuoyan Zhou, Nannan Wang, Decheng Liu et al.
Deep neural networks are vulnerable to adversarial noise. Adversarial Training (AT) has been demonstrated to be the most effective defense strategy to protect neural networks from being fooled. However, we find AT omits to learning robust features, resulting in poor performance of adversarial robustness. To address this issue, we highlight two criteria of robust representation: (1) Exclusion: \emph{the feature of examples keeps away from that of other classes}; (2) Alignment: \emph{the feature of natural and corresponding adversarial examples is close to each other}. These motivate us to propose a generic framework of AT to gain robust representation, by the asymmetric negative contrast and reverse attention. Specifically, we design an asymmetric negative contrast based on predicted probabilities, to push away examples of different classes in the feature space. Moreover, we propose to weight feature by parameters of the linear classifier as the reverse attention, to obtain class-aware feature and pull close the feature of the same class. Empirical evaluations on three benchmark datasets show our methods greatly advance the robustness of AT and achieve state-of-the-art performance.
LGFeb 4, 2024Code
Jailbreaking Attack against Multimodal Large Language ModelZhenxing Niu, Haodong Ren, Xinbo Gao et al.
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}