DreamDiffusion: Generating High-Quality Images from Brain EEG SignalsYunpeng Bai, Xintao Wang, Yan-pei Cao et al. · tencent-ai
This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. DreamDiffusion leverages pre-trained text-to-image models and employs temporal masked signal modeling to pre-train the EEG encoder for effective and robust EEG representations. Additionally, the method further leverages the CLIP image encoder to provide extra supervision to better align EEG, text, and image embeddings with limited EEG-image pairs. Overall, the proposed method overcomes the challenges of using EEG signals for image generation, such as noise, limited information, and individual differences, and achieves promising results. Quantitative and qualitative results demonstrate the effectiveness of the proposed method as a significant step towards portable and low-cost ``thoughts-to-image'', with potential applications in neuroscience and computer vision. The code is available here \url{https://github.com/bbaaii/DreamDiffusion}.
Learning Imbalanced Data with Vision TransformersZhengzhuo Xu, Ruikang Liu, Shuo Yang et al.
The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. In addition, Binary Cross Entropy (BCE) loss, which shows conspicuous performance with ViTs, encounters predicaments in LTR. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins to deploy it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.
Masked Generative DistillationZhendong Yang, Zhe Li, Mingqi Shao et al.
Knowledge distillation has been applied to various tasks successfully. The current distillation algorithm usually improves students' performance by imitating the output of the teacher. This paper shows that teachers can also improve students' representation power by guiding students' feature recovery. From this point of view, we propose Masked Generative Distillation (MGD), which is simple: we mask random pixels of the student's feature and force it to generate the teacher's full feature through a simple block. MGD is a truly general feature-based distillation method, which can be utilized on various tasks, including image classification, object detection, semantic segmentation and instance segmentation. We experiment on different models with extensive datasets and the results show that all the students achieve excellent improvements. Notably, we boost ResNet-18 from 69.90% to 71.69% ImageNet top-1 accuracy, RetinaNet with ResNet-50 backbone from 37.4 to 41.0 Boundingbox mAP, SOLO based on ResNet-50 from 33.1 to 36.2 Mask mAP and DeepLabV3 based on ResNet-18 from 73.20 to 76.02 mIoU. Our codes are available at https://github.com/yzd-v/MGD.
Effective Whole-body Pose Estimation with Two-stages DistillationZhendong Yang, Ailing Zeng, Chun Yuan et al.
Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image. This task is challenging due to multi-scale body parts, fine-grained localization for low-resolution regions, and data scarcity. Meanwhile, applying a highly efficient and accurate pose estimator to widely human-centric understanding and generation tasks is urgent. In this work, we present a two-stage pose \textbf{D}istillation for \textbf{W}hole-body \textbf{P}ose estimators, named \textbf{DWPose}, to improve their effectiveness and efficiency. The first-stage distillation designs a weight-decay strategy while utilizing a teacher's intermediate feature and final logits with both visible and invisible keypoints to supervise the student from scratch. The second stage distills the student model itself to further improve performance. Different from the previous self-knowledge distillation, this stage finetunes the student's head with only 20% training time as a plug-and-play training strategy. For data limitations, we explore the UBody dataset that contains diverse facial expressions and hand gestures for real-life applications. Comprehensive experiments show the superiority of our proposed simple yet effective methods. We achieve new state-of-the-art performance on COCO-WholeBody, significantly boosting the whole-body AP of RTMPose-l from 64.8% to 66.5%, even surpassing RTMPose-x teacher with 65.3% AP. We release a series of models with different sizes, from tiny to large, for satisfying various downstream tasks. Our codes and models are available at https://github.com/IDEA-Research/DWPose.
SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-MaximizationZhihui Lin, Tianyu Yang, Maomao Li et al.
Matching-based methods, especially those based on space-time memory, are significantly ahead of other solutions in semi-supervised video object segmentation (VOS). However, continuously growing and redundant template features lead to an inefficient inference. To alleviate this, we propose a novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly reduce the redundancy of memory features. Different from the previous methods which only detect feature redundancy between frames, SWEM merges both intra-frame and inter-frame similar features by leveraging the sequential weighted EM algorithm. Further, adaptive weights for frame features endow SWEM with the flexibility to represent hard samples, improving the discrimination of templates. Besides, the proposed method maintains a fixed number of template features in memory, which ensures the stable inference complexity of the VOS system. Extensive experiments on commonly used DAVIS and YouTube-VOS datasets verify the high efficiency (36 FPS) and high performance (84.3\% $\mathcal{J}\&\mathcal{F}$ on DAVIS 2017 validation dataset) of SWEM. Code is available at: https://github.com/lmm077/SWEM.
From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft LabelsZhendong Yang, Ailing Zeng, Zhe Li et al.
Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at https://github.com/yzd-v/cls_KD.
One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic ModulationsYiming Zhu, Hongyu Liu, Yibing Song et al.
Free-form text prompts allow users to describe their intentions during image manipulation conveniently. Based on the visual latent space of StyleGAN[21] and text embedding space of CLIP[34], studies focus on how to map these two latent spaces for text-driven attribute manipulations. Currently, the latent mapping between these two spaces is empirically designed and confines that each manipulation model can only handle one fixed text prompt. In this paper, we propose a method named Free-Form CLIP (FFCLIP), aiming to establish an automatic latent mapping so that one manipulation model handles free-form text prompts. Our FFCLIP has a cross-modality semantic modulation module containing semantic alignment and injection. The semantic alignment performs the automatic latent mapping via linear transformations with a cross attention mechanism. After alignment, we inject semantics from text prompt embeddings to the StyleGAN latent space. For one type of image (e.g., `human portrait'), one FFCLIP model can be learned to handle free-form text prompts. Meanwhile, we observe that although each training text prompt only contains a single semantic meaning, FFCLIP can leverage text prompts with multiple semantic meanings for image manipulation. In the experiments, we evaluate FFCLIP on three types of images (i.e., `human portraits', `cars', and `churches'). Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images. Project page: https://github.com/KumapowerLIU/FFCLIP.
16.1CVMar 20, 2023
HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic DetailsZenghao Chai, Tianke Zhang, Tianyu He et al. · microsoft-research
3D Morphable Models (3DMMs) demonstrate great potential for reconstructing faithful and animatable 3D facial surfaces from a single image. The facial surface is influenced by the coarse shape, as well as the static detail (e,g., person-specific appearance) and dynamic detail (e.g., expression-driven wrinkles). Previous work struggles to decouple the static and dynamic details through image-level supervision, leading to reconstructions that are not realistic. In this paper, we aim at high-fidelity 3D face reconstruction and propose HiFace to explicitly model the static and dynamic details. Specifically, the static detail is modeled as the linear combination of a displacement basis, while the dynamic detail is modeled as the linear interpolation of two displacement maps with polarized expressions. We exploit several loss functions to jointly learn the coarse shape and fine details with both synthetic and real-world datasets, which enable HiFace to reconstruct high-fidelity 3D shapes with animatable details. Extensive quantitative and qualitative experiments demonstrate that HiFace presents state-of-the-art reconstruction quality and faithfully recovers both the static and dynamic details. Our project page can be found at https://project-hiface.github.io.
ViTKD: Practical Guidelines for ViT feature knowledge distillationZhendong Yang, Zhe Li, Ailing Zeng et al.
Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny from 74.42% to 76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%. Moreover, ViTKD and the logit-based KD method are complementary and can be applied together directly. This combination can further improve the performance of the student. Specifically, the student DeiT-Tiny, Small, and Base achieve 77.78%, 83.59%, and 85.41%, respectively. The code is available at https://github.com/yzd-v/cls_KD.
UPop: Unified and Progressive Pruning for Compressing Vision-Language TransformersDachuan Shi, Chaofan Tao, Ying Jin et al.
Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, \textit{e}.\textit{g}., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (\textbf{\emph{UPop}}) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.
16.7CVNov 28, 2022
High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative PriorsYunpeng Bai, Yanbo Fan, Xuan Wang et al. · tsinghua
High-fidelity facial avatar reconstruction from a monocular video is a significant research problem in computer graphics and computer vision. Recently, Neural Radiance Field (NeRF) has shown impressive novel view rendering results and has been considered for facial avatar reconstruction. However, the complex facial dynamics and missing 3D information in monocular videos raise significant challenges for faithful facial reconstruction. In this work, we propose a new method for NeRF-based facial avatar reconstruction that utilizes 3D-aware generative prior. Different from existing works that depend on a conditional deformation field for dynamic modeling, we propose to learn a personalized generative prior, which is formulated as a local and low dimensional subspace in the latent space of 3D-GAN. We propose an efficient method to construct the personalized generative prior based on a small set of facial images of a given individual. After learning, it allows for photo-realistic rendering with novel views and the face reenactment can be realized by performing navigation in the latent space. Our proposed method is applicable for different driven signals, including RGB images, 3DMM coefficients, and audios. Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance.
HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image RetrievalChengyin Xu, Zenghao Chai, Zhengzhuo Xu et al.
Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computational overhead and converge difficulty, while the proxy loss is theoretically incapable of expressing the profound label dependencies and exhibits conflicts in the constructed hypersphere space. To address the problems, we propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$^2$ Loss) that constructs an expressive metric space with efficient training complexity w.r.t. the whole dataset. The proposed HyP$^2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs, which integrates sufficient data correspondence of pair-based methods and high-efficiency of proxy-based methods. Extensive experiments on four standard multi-label benchmarks justify the proposed method outperforms the state-of-the-art, is robust among different hash bits and achieves significant performance gains with a faster, more stable convergence speed. Our code is available at https://github.com/JerryXu0129/HyP2-Loss.
Rethinking Knowledge Distillation via Cross-EntropyZhendong Yang, Zhe Li, Yuan Gong et al.
Knowledge Distillation (KD) has developed extensively and boosted various tasks. The classical KD method adds the KD loss to the original cross-entropy (CE) loss. We try to decompose the KD loss to explore its relation with the CE loss. Surprisingly, we find it can be regarded as a combination of the CE loss and an extra loss which has the identical form as the CE loss. However, we notice the extra loss forces the student's relative probability to learn the teacher's absolute probability. Moreover, the sum of the two probabilities is different, making it hard to optimize. To address this issue, we revise the formulation and propose a distributed loss. In addition, we utilize teachers' target output as the soft target, proposing the soft loss. Combining the soft loss and the distributed loss, we propose a new KD loss (NKD). Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD). Our method achieves state-of-the-art performance on CIFAR-100 and ImageNet. For example, with ResNet-34 as the teacher, we boost the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96%. In training without teachers, MobileNet, ResNet-18 and SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%, 0.86%, and 0.30% higher than the baseline, respectively. The code is available at https://github.com/yzd-v/cls_KD.
10.1CVApr 29, 2022
Privacy-Preserving Model Upgrades with Bidirectional Compatible Training in Image RetrievalShupeng Su, Binjie Zhang, Yixiao Ge et al. · tencent-ai
The task of privacy-preserving model upgrades in image retrieval desires to reap the benefits of rapidly evolving new models without accessing the raw gallery images. A pioneering work introduced backward-compatible training, where the new model can be directly deployed in a backfill-free manner, i.e., the new query can be directly compared to the old gallery features. Despite a possible solution, its improvement in sequential model upgrades is gradually limited by the fixed and under-quality old gallery embeddings. To this end, we propose a new model upgrade paradigm, termed Bidirectional Compatible Training (BiCT), which will upgrade the old gallery embeddings by forward-compatible training towards the embedding space of the backward-compatible new model. We conduct comprehensive experiments to verify the prominent improvement by BiCT and interestingly observe that the inconspicuous loss weight of backward compatibility actually plays an essential role for both backward and forward retrieval performance. To summarize, we introduce a new and valuable problem named privacy-preserving model upgrades, with a proper solution BiCT. Several intriguing insights are further proposed to get the most out of our method.
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language TasksHao Li, Jinguo Zhu, Xiaohu Jiang et al.
Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
4.8CVOct 13, 2022
Darwinian Model Upgrades: Model Evolving with Selective CompatibilityBinjie Zhang, Shupeng Su, Yixiao Ge et al. · tencent-ai
The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications. BCT presents the first step towards backward-compatible model upgrades to get rid of backfilling. It is workable but leaves the new model in a dilemma between new feature discriminativeness and new-to-old compatibility due to the undifferentiated compatibility constraints. In this work, we propose Darwinian Model Upgrades (DMU), which disentangle the inheritance and variation in the model evolving with selective backward compatibility and forward adaptation, respectively. The old-to-new heritable knowledge is measured by old feature discriminativeness, and the gallery features, especially those of poor quality, are evolved in a lightweight manner to become more adaptive in the new latent space. We demonstrate the superiority of DMU through comprehensive experiments on large-scale landmark retrieval and face recognition benchmarks. DMU effectively alleviates the new-to-new degradation and improves new-to-old compatibility, rendering a more proper model upgrading paradigm in large-scale retrieval systems.
8.1IVOct 9, 2022
Super-Resolution by Predicting Offsets: An Ultra-Efficient Super-Resolution Network for Rasterized ImagesJinjin Gu, Haoming Cai, Chenyu Dong et al. · eth-zurich
Rendering high-resolution (HR) graphics brings substantial computational costs. Efficient graphics super-resolution (SR) methods may achieve HR rendering with small computing resources and have attracted extensive research interests in industry and research communities. We present a new method for real-time SR for computer graphics, namely Super-Resolution by Predicting Offsets (SRPO). Our algorithm divides the image into two parts for processing, i.e., sharp edges and flatter areas. For edges, different from the previous SR methods that take the anti-aliased images as inputs, our proposed SRPO takes advantage of the characteristics of rasterized images to conduct SR on the rasterized images. To complement the residual between HR and low-resolution (LR) rasterized images, we train an ultra-efficient network to predict the offset maps to move the appropriate surrounding pixels to the new positions. For flat areas, we found simple interpolation methods can already generate reasonable output. We finally use a guided fusion operation to integrate the sharp edges generated by the network and flat areas by the interpolation method to get the final SR image. The proposed network only contains 8,434 parameters and can be accelerated by network quantization. Extensive experiments show that the proposed SRPO can achieve superior visual effects at a smaller computational cost than the existing state-of-the-art methods.
AANet: Aggregation and Alignment Network with Semi-hard Positive Sample Mining for Hierarchical Place RecognitionFeng Lu, Lijun Zhang, Shuting Dong et al.
Visual place recognition (VPR) is one of the research hotspots in robotics, which uses visual information to locate robots. Recently, the hierarchical two-stage VPR methods have become popular in this field due to the trade-off between accuracy and efficiency. These methods retrieve the top-k candidate images using the global features in the first stage, then re-rank the candidates by matching the local features in the second stage. However, they usually require additional algorithms (e.g. RANSAC) for geometric consistency verification in re-ranking, which is time-consuming. Here we propose a Dynamically Aligning Local Features (DALF) algorithm to align the local features under spatial constraints. It is significantly more efficient than the methods that need geometric consistency verification. We present a unified network capable of extracting global features for retrieving candidates via an aggregation module and aligning local features for re-ranking via the DALF alignment module. We call this network AANet. Meanwhile, many works use the simplest positive samples in triplet for weakly supervised training, which limits the ability of the network to recognize harder positive pairs. To address this issue, we propose a Semi-hard Positive Sample Mining (ShPSM) strategy to select appropriate hard positive images for training more robust VPR networks. Extensive experiments on four benchmark VPR datasets show that the proposed AANet can outperform several state-of-the-art methods with less time consumption. The code is released at https://github.com/Lu-Feng/AANet.
6.8CVMar 20, 2023
HMC: Hierarchical Mesh Coarsening for Skeleton-free Motion RetargetingHaoyu Wang, Shaoli Huang, Fang Zhao et al.
We present a simple yet effective method for skeleton-free motion retargeting. Previous methods transfer motion between high-resolution meshes, failing to preserve the inherent local-part motions in the mesh. Addressing this issue, our proposed method learns the correspondence in a coarse-to-fine fashion by integrating the retargeting process with a mesh-coarsening pipeline. First, we propose a mesh-coarsening module that coarsens the mesh representations for better motion transfer. This module improves the ability to handle small-part motion and preserves the local motion interdependence between neighboring mesh vertices. Furthermore, we leverage a hierarchical refinement procedure to complement missing mesh details by gradually improving the low-resolution mesh output with a higher-resolution one. We evaluate our method on several well-known 3D character datasets, and it yields an average improvement of 25% on point-wise mesh euclidean distance (PMD) against the start-of-art method. Moreover, our qualitative results show that our method is significantly helpful in preserving the moving consistency of different body parts on the target character due to disentangling body-part structures and mesh details in a hierarchical way.
Architecture, Dataset and Model-Scale Agnostic Data-free Meta-LearningZixuan Hu, Li Shen, Zhenyi Wang et al.
The goal of data-free meta-learning is to learn useful prior knowledge from a collection of pre-trained models without accessing their training data. However, existing works only solve the problem in parameter space, which (i) ignore the fruitful data knowledge contained in the pre-trained models; (ii) can not scale to large-scale pre-trained models; (iii) can only meta-learn pre-trained models with the same network architecture. To address those issues, we propose a unified framework, dubbed PURER, which contains: (1) ePisode cUrriculum inveRsion (ECI) during data-free meta training; and (2) invErsion calibRation following inner loop (ICFIL) during meta testing. During meta training, we propose ECI to perform pseudo episode training for learning to adapt fast to new unseen tasks. Specifically, we progressively synthesize a sequence of pseudo episodes by distilling the training data from each pre-trained model. The ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model. We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner. During meta testing, we further propose a simple plug-and-play supplement-ICFIL-only used during meta testing to narrow the gap between meta training and meta testing task distribution. Extensive experiments in various real-world scenarios show the superior performance of ours.
2.8CVJun 17, 2023
MA-NeRF: Motion-Assisted Neural Radiance Fields for Face Synthesis from Sparse ImagesWeichen Zhang, Xiang Zhou, Yukang Cao et al.
We address the problem of photorealistic 3D face avatar synthesis from sparse images. Existing Parametric models for face avatar reconstruction struggle to generate details that originate from inputs. Meanwhile, although current NeRF-based avatar methods provide promising results for novel view synthesis, they fail to generalize well for unseen expressions. We improve from NeRF and propose a novel framework that, by leveraging the parametric 3DMM models, can reconstruct a high-fidelity drivable face avatar and successfully handle the unseen expressions. At the core of our implementation are structured displacement feature and semantic-aware learning module. Our structured displacement feature will introduce the motion prior as an additional constraints and help perform better for unseen expressions, by constructing displacement volume. Besides, the semantic-aware learning incorporates multi-level prior, e.g., semantic embedding, learnable latent code, to lift the performance to a higher level. Thorough experiments have been doen both quantitatively and qualitatively to demonstrate the design of our framework, and our method achieves much better results than the current state-of-the-arts.
18.4CVOct 30, 2023
CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion ModelsZiyang Yuan, Mingdeng Cao, Xintao Wang et al.
Incorporating a customized object into image generation presents an attractive feature in text-to-image generation. However, existing optimization-based and encoder-based methods are hindered by drawbacks such as time-consuming optimization, insufficient identity preservation, and a prevalent copy-pasting effect. To overcome these limitations, we introduce CustomNet, a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. This integration facilitates the adjustment of spatial position relationships and viewpoints, yielding diverse outputs while effectively preserving object identity. Moreover, we introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images, overcoming the limitations of existing 3D novel view synthesis methods. We further leverage a dataset construction pipeline that can better handle real-world objects and complex backgrounds. Equipped with these designs, our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background. As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.
17.3LGNov 20, 2022
When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration MethodManyi Zhang, Xuyang Zhao, Jun Yao et al.
Real-world large-scale datasets are both noisily labeled and class-imbalanced. The issues seriously hurt the generalization of trained models. It is hence significant to address the simultaneous incorrect labeling and class-imbalance, i.e., the problem of learning with noisy labels on long-tailed data. Previous works develop several methods for the problem. However, they always rely on strong assumptions that are invalid or hard to be checked in practice. In this paper, to handle the problem and address the limitations of prior works, we propose a representation calibration method RCAL. Specifically, RCAL works with the representations extracted by unsupervised contrastive learning. We assume that without incorrect labeling and class imbalance, the representations of instances in each class conform to a multivariate Gaussian distribution, which is much milder and easier to be checked. Based on the assumption, we recover underlying representation distributions from polluted ones resulting from mislabeled and class-imbalanced data. Additional data points are then sampled from the recovered distributions to help generalization. Moreover, during classifier training, representation learning takes advantage of representation robustness brought by contrastive learning, which further improves the classifier performance. We derive theoretical results to discuss the effectiveness of our representation calibration. Experiments on multiple benchmarks justify our claims and confirm the superiority of the proposed method.
20.6CVMar 22, 2023
Make Encoder Great Again in 3D GAN Inversion through Geometry and Occlusion-Aware EncodingZiyang Yuan, Yiming Zhu, Yu Li et al.
3D GAN inversion aims to achieve high reconstruction fidelity and reasonable 3D geometry simultaneously from a single image input. However, existing 3D GAN inversion methods rely on time-consuming optimization for each individual case. In this work, we introduce a novel encoder-based inversion framework based on EG3D, one of the most widely-used 3D GAN models. We leverage the inherent properties of EG3D's latent space to design a discriminator and a background depth regularization. This enables us to train a geometry-aware encoder capable of converting the input image into corresponding latent code. Additionally, we explore the feature space of EG3D and develop an adaptive refinement stage that improves the representation ability of features in EG3D to enhance the recovery of fine-grained textural details. Finally, we propose an occlusion-aware fusion operation to prevent distortion in unobserved regions. Our method achieves impressive results comparable to optimization-based methods while operating up to 500 times faster. Our framework is well-suited for applications such as semantic editing.
4.6LGOct 11, 2022
Tackling Instance-Dependent Label Noise with Dynamic Distribution CalibrationManyi Zhang, Yuxin Ren, Zihao Wang et al. · tsinghua
Instance-dependent label noise is realistic but rather challenging, where the label-corruption process depends on instances directly. It causes a severe distribution shift between the distributions of training and test data, which impairs the generalization of trained models. Prior works put great effort into tackling the issue. Unfortunately, these works always highly rely on strong assumptions or remain heuristic without theoretical guarantees. In this paper, to address the distribution shift in learning with instance-dependent label noise, a dynamic distribution-calibration strategy is adopted. Specifically, we hypothesize that, before training data are corrupted by label noise, each class conforms to a multivariate Gaussian distribution at the feature level. Label noise produces outliers to shift the Gaussian distribution. During training, to calibrate the shifted distribution, we propose two methods based on the mean and covariance of multivariate Gaussian distribution respectively. The mean-based method works in a recursive dimension-reduction manner for robust mean estimation, which is theoretically guaranteed to train a high-quality model against label noise. The covariance-based method works in a distribution disturbance manner, which is experimentally verified to improve the model robustness. We demonstrate the utility and effectiveness of our methods on datasets with synthetic label noise and real-world unknown noise.
Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer EraFeng Lu, Tong Jin, Canming Ye et al.
Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.
1.5CVJan 25, 2023
Towards Arbitrary Text-driven Image Manipulation via Space AlignmentYunpeng Bai, Zihan Zhong, Chao Dong et al.
The recent GAN inversion methods have been able to successfully invert the real image input to the corresponding editable latent code in StyleGAN. By combining with the language-vision model (CLIP), some text-driven image manipulation methods are proposed. However, these methods require extra costs to perform optimization for a certain image or a new attribute editing mode. To achieve a more efficient editing method, we propose a new Text-driven image Manipulation framework via Space Alignment (TMSA). The Space Alignment module aims to align the same semantic regions in CLIP and StyleGAN spaces. Then, the text input can be directly accessed into the StyleGAN space and be used to find the semantic shift according to the text description. The framework can support arbitrary image editing mode without additional cost. Our work provides the user with an interface to control the attributes of a given image according to text input and get the result in real time. Ex tensive experiments demonstrate our superior performance over prior works.
24.5AISep 5, 2024
ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart UnderstandingZhengzhuo Xu, Bowen Qu, Yiyan Qi et al.
Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, current MLLMs still struggle to provide faithful data and reliable analysis only based on charts. To address it, we propose ChartMoE, which employs the Mixture of Expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train several linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with nearly 1 million chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts diversely and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48\% to 84.64\% on the ChartQA benchmark.
3.9CVFeb 28, 2023
Rethink Long-tailed Recognition with Vision TransformersZhengzhuo Xu, Shuo Yang, Xingjun Wang et al.
In the real world, data tends to follow long-tailed distributions w.r.t. class or attribution, motivating the challenging Long-Tailed Recognition (LTR) problem. In this paper, we revisit recent LTR methods with promising Vision Transformers (ViT). We figure out that 1) ViT is hard to train with long-tailed data. 2) ViT learns generalized features in an unsupervised manner, like mask generative training, either on long-tailed or balanced datasets. Hence, we propose to adopt unsupervised learning to utilize long-tailed data. Furthermore, we propose the Predictive Distribution Calibration (PDC) as a novel metric for LTR, where the model tends to simply classify inputs into common classes. Our PDC can measure the model calibration of predictive preferences quantitatively. On this basis, we find many LTR approaches alleviate it slightly, despite the accuracy improvement. Extensive experiments on benchmark datasets validate that PDC reflects the model's predictive preference precisely, which is consistent with the visualization.
5.7CVDec 27, 2022
Truncate-Split-Contrast: A Framework for Learning from Mislabeled VideosZixiao Wang, Junwu Weng, Chun Yuan et al.
Learning with noisy label (LNL) is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering the properties of videos, such as computational cost and redundant information, is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) A lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category; 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed tru{\bf N}cat{\bf E}-split-contr{\bf A}s{\bf T} (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10\% of it, our method achieves over 0.4 noise detection F1-score and 5\% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80\%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6\%.
30.3CVDec 26, 2023Code
ChartBench: A Benchmark for Complex Visual Reasoning in ChartsZhengzhuo Xu, Sinan Du, Yiyan Qi et al.
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. However, current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. To address this, we propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning. ChartBench includes 42 categories, 66.6k charts, and 600k question-answer pairs. Notably, many charts lack data point annotations, which requires MLLMs to derive values similar to human understanding by leveraging inherent chart elements such as color, legends, and coordinate systems. We also design an enhanced evaluation metric, Acc+, to evaluate MLLMs without extensive manual or costly LLM-based evaluations. Furthermore, we propose two baselines based on the chain of thought and supervised fine-tuning to improve model performance on unannotated charts. Extensive experimental evaluations of 18 open-sourced and 3 proprietary MLLMs reveal their limitations in chart comprehension and offer valuable insights for further research. Code and dataset are publicly available at https://chartbench.github.io.
6.6LGNov 23, 2023
Task-Distributionally Robust Data-Free Meta-LearningZixuan Hu, Li Shen, Zhenyi Wang et al.
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. Existing inversion-based DFML methods construct pseudo tasks from a learnable dataset, which is inversely generated from the pre-trained model pool. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift (TDS) and Task-Distribution Corruption (TDC). TDS leads to a biased meta-learner because of the skewed task distribution towards newly generated tasks. TDC occurs when untrusted models characterized by misleading labels or poor quality pollute the task distribution. To tackle these issues, we introduce a robust DFML framework that ensures task distributional robustness. We propose to meta-learn from a pseudo task distribution, diversified through task interpolation within a compact task-memory buffer. This approach reduces the meta-learner's overreliance on newly generated tasks by maintaining consistent performance across a broader range of interpolated memory tasks, thus ensuring its generalization for unseen tasks. Additionally, our framework seamlessly incorporates an automated model selection mechanism into the meta-training phase, parameterizing each model's reliability as a learnable weight. This is optimized with a policy gradient algorithm inspired by reinforcement learning, effectively addressing the non-differentiable challenge posed by model selection. Comprehensive experiments across various datasets demonstrate the framework's effectiveness in mitigating TDS and TDC, underscoring its potential to improve DFML in real-world scenarios.
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place RecognitionFeng Lu, Xiangyuan Lan, Lijun Zhang et al.
Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR.
14.5CVFeb 28, 2023
TextIR: A Simple Framework for Text-based Editable Image RestorationYunpeng Bai, Cairong Wang, Shuzhao Xie et al.
Most existing image restoration methods use neural networks to learn strong image-level priors from huge data to estimate the lost information. However, these works still struggle in cases when images have severe information deficits. Introducing external priors or using reference images to provide information also have limitations in the application domain. In contrast, text input is more readily available and provides information with higher flexibility. In this work, we design an effective framework that allows the user to control the restoration process of degraded images with text descriptions. We use the text-image feature compatibility of the CLIP to alleviate the difficulty of fusing text and image features. Our framework can be used for various image restoration tasks, including image inpainting, image super-resolution, and image colorization. Extensive experiments demonstrate the effectiveness of our method.
Towards Seamless Adaptation of Pre-trained Models for Visual Place RecognitionFeng Lu, Lijun Zhang, Xiangyuan Lan et al.
Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.
IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens IntactRuikang Liu, Haoli Bai, Haokun Lin et al.
Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement over various quantization methods across different LLMs and downstream tasks, leading to the new state-of-the-art for LLM quantization. The codes are available at https://github.com/ruikangliu/IntactKV.
FlatQuant: Flatness Matters for LLM QuantizationYuxuan Sun, Ruikang Liu, Haoli Bai et al.
Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1\% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5\%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at: https://github.com/ruikangliu/FlatQuant.
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning ModelsRuikang Liu, Yuxuan Sun, Manyi Zhang et al.
Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this paper, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, QwQ-32B, and Qwen3-8B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes are open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.
5.9CVJan 26, 2023
ITstyler: Image-optimized Text-based Style TransferYunpeng Bai, Jiayue Liu, Chao Dong et al.
Text-based style transfer is a newly-emerging research topic that uses text information instead of style image to guide the transfer process, significantly extending the application scenario of style transfer. However, previous methods require extra time for optimization or text-image paired data, leading to limited effectiveness. In this work, we achieve a data-efficient text-based style transfer method that does not require optimization at the inference stage. Specifically, we convert text input to the style space of the pre-trained VGG network to realize a more effective style swap. We also leverage CLIP's multi-modal embedding space to learn the text-to-style mapping with the image dataset only. Our method can transfer arbitrary new styles of text input in real-time and synthesize high-quality artistic images.
1.4CVMay 24, 2022
Improving the Latent Space of Image Style TransferYunpeng Bai, Cairong Wang, Chun Yuan et al.
Existing neural style transfer researches have studied to match statistical information between the deep features of content and style images, which were extracted by a pre-trained VGG, and achieved significant improvement in synthesizing artistic images. However, in some cases, the feature statistics from the pre-trained encoder may not be consistent with the visual style we perceived. For example, the style distance between images of different styles is less than that of the same style. In such an inappropriate latent space, the objective function of the existing methods will be optimized in the wrong direction, resulting in bad stylization results. In addition, the lack of content details in the features extracted by the pre-trained encoder also leads to the content leak problem. In order to solve these issues in the latent space used by style transfer, we propose two contrastive training schemes to get a refined encoder that is more suitable for this task. The style contrastive loss pulls the stylized result closer to the same visual style image and pushes it away from the content image. The content contrastive loss enables the encoder to retain more available details. We can directly add our training scheme to some existing style transfer methods and significantly improve their results. Extensive experimental results demonstrate the effectiveness and superiority of our methods.
Deep Homography Estimation for Visual Place RecognitionFeng Lu, Shuting Dong, Lijun Zhang et al.
Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.
Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolutionYutao Yuan, Chun Yuan
Image super-resolution is a fundamentally ill-posed problem because multiple valid high-resolution images exist for one low-resolution image. Super-resolution methods based on diffusion probabilistic models can deal with the ill-posed nature by learning the distribution of high-resolution images conditioned on low-resolution images, avoiding the problem of blurry images in PSNR-oriented methods. However, existing diffusion-based super-resolution methods have high time consumption with the use of iterative sampling, while the quality and consistency of generated images are less than ideal due to problems like color shifting. In this paper, we propose Efficient Conditional Diffusion Model with Probability Flow Sampling (ECDP) for image super-resolution. To reduce the time consumption, we design a continuous-time conditional diffusion model for image super-resolution, which enables the use of probability flow sampling for efficient generation. Additionally, to improve the consistency of generated images, we propose a hybrid parametrization for the denoiser network, which interpolates between the data-predicting parametrization and the noise-predicting parametrization for different noise scales. Moreover, we design an image quality loss as a complement to the score matching loss of diffusion models, further improving the consistency and quality of super-resolution. Extensive experiments on DIV2K, ImageNet, and CelebA demonstrate that our method achieves higher super-resolution quality than existing diffusion-based image super-resolution methods while having lower time consumption. Our code is available at https://github.com/Yuan-Yutao/ECDP.
Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative ModelsHongyang Wei, Shuaizheng Liu, Chun Yuan et al.
By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at https://github.com/nonwhy/PURE.
16.4CVApr 16, 2025Code
Cobra: Efficient Line Art COlorization with BRoAder ReferencesJunhao Zhuang, Lingen Li, Xuan Ju et al.
The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: https://zhuang2002.github.io/Cobra/.
11.0CVOct 24, 2023
Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer FrameworkWeixi Weng, Chun Yuan
Unsupervised domain adaptation object detection (UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. One-stage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer (OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment (MDQFA) and Masked Token-wise Feature Alignment (MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model's target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM.
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle MatricesHao Yu, Tangyu Jiang, Shuning Jia et al.
The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available at https://github.com/Longin-Yu/ComRoPE
A Simple Linear Patch Revives Layer-Pruned Large Language ModelsXinrui Chen, Haoli Bai, Tao Yuan et al.
Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16\% within only 30 minutes on a single GPU. Code is available at https://github.com/chenxinrui-tsinghua/LinearPatch.
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersDachuan Shi, Chaofan Tao, Anyi Rao et al.
Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and ensemble to effectively utilize cross-modal information, achieving wider applicability across both modality-independent models, e.g., CLIP, and modality-dependent ones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an algorithm for the token-matching mechanism, ensuring reliable matching results while facilitating parallelizability and high efficiency. Extensive experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering. The performance on both classic multimodal architectures and emerging multimodal LLMs demonstrates the framework's effectiveness and versatility. The code is available at https://github.com/sdc17/CrossGET.
Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image RetrievalBinjie Zhang, Yixiao Ge, Yantao Shen et al.
The task of hot-refresh model upgrades of image retrieval systems plays an essential role in the industry but has never been investigated in academia before. Conventional cold-refresh model upgrades can only deploy new models after the gallery is overall backfilled, taking weeks or even months for massive data. In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly. Compatible training has made it possible, however, the problem of model regression with negative flips poses a great challenge to the stable improvement of user experience. We argue that it is mainly due to the fact that new-to-old positive query-gallery pairs may show less similarity than new-to-new negative pairs. To solve the problem, we introduce a Regression-Alleviating Compatible Training (RACT) method to properly constrain the feature compatibility while reducing negative flips. The core is to encourage the new-to-old positive pairs to be more similar than both the new-to-old negative pairs and the new-to-new negative pairs. An efficient uncertainty-based backfilling strategy is further introduced to fasten accuracy improvements. Extensive experiments on large-scale retrieval benchmarks (e.g., Google Landmark) demonstrate that our RACT effectively alleviates the model regression for one more step towards seamless model upgrades. The code will be available at https://github.com/binjiezhang/RACT_ICLR2022.
HHF: Hashing-guided Hinge Function for Deep Hashing RetrievalChengyin Xu, Zenghao Chai, Zhengzhuo Xu et al.
Deep hashing has shown promising performance in large-scale image retrieval. However, latent codes extracted by Deep Neural Networks (DNNs) will inevitably lose semantic information during the binarization process, which damages the retrieval accuracy and makes it challenging. Although many existing approaches perform regularization to alleviate quantization errors, we figure out an incompatible conflict between metric learning and quantization learning. The metric loss penalizes the inter-class distances to push different classes unconstrained far away. Worse still, it tends to map the latent code deviate from ideal binarization point and generate severe ambiguity in the binarization process. Based on the minimum distance of the binary linear code, we creatively propose Hashing-guided Hinge Function (HHF) to avoid such conflict. In detail, the carefully-designed inflection point, which relies on the hash bit length and category numbers, is explicitly adopted to balance the metric term and quantization term. Such a modification prevents the network from falling into local metric optimal minima in deep hashing. Extensive experiments in CIFAR-10, CIFAR-100, ImageNet, and MS-COCO show that HHF consistently outperforms existing techniques, and is robust and flexible to transplant into other methods. Code is available at https://github.com/JerryXu0129/HHF.