h-index98
237papers
9,568citations
Novelty53%
AI Score63

237 Papers

CVAug 7, 2023Code
Dual Aggregation Transformer for Image Super-Resolution

Zheng Chen, Yulun Zhang, Jinjin Gu et al. · eth-zurich

Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT.

CVMar 11, 2023Code
Recursive Generalization Transformer for Image Super-Resolution

Zheng Chen, Yulun Zhang, Jinjin Gu et al. · eth-zurich

Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is crucial for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled to mitigate the redundancy in the channel domain. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively. Code and pre-trained models are available at https://github.com/zhengchen1999/RGT.

CVMar 27, 2023Code
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective

Weixia Zhang, Guangtao Zhai, Ying Wei et al.

We aim at advancing blind image quality assessment (BIQA), which predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks, in a way that the model parameter sharing and the loss weighting are determined automatically. Specifically, we first describe all candidate label combinations (from multiple tasks) using a textual template, and compute the joint probability from the cosine similarities of the visual-textual embeddings. Predictions of each task can be inferred from the joint distribution, and optimized by carefully designed loss functions. Through comprehensive experiments on learning three tasks - BIQA, scene classification, and distortion type identification, we verify that the proposed BIQA method 1) benefits from the scene classification and distortion type identification tasks and outperforms the state-of-the-art on multiple IQA datasets, 2) is more robust in the group maximum differentiation competition, and 3) realigns the quality annotations from different IQA datasets more effectively. The source code is available at https://github.com/zwx8981/LIQE.

CVMar 11, 2023Code
Xformer: Hybrid X-Shaped Transformer for Image Denoising

Jiale Zhang, Yulun Zhang, Jinjin Gu et al. · eth-zurich

In this paper, we present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks. We explore strengthening the global representation of tokens from different scopes. In detail, we adopt two types of Transformer blocks. The spatial-wise Transformer block performs fine-grained local patches interactions across tokens defined by spatial dimension. The channel-wise Transformer block performs direct global context interactions across tokens defined by channel dimension. Based on the concurrent network structure, we design two branches to conduct these two interaction fashions. Within each branch, we employ an encoder-decoder architecture to capture multi-scale features. Besides, we propose the Bidirectional Connection Unit (BCU) to couple the learned representations from these two branches while providing enhanced information fusion. The joint designs make our Xformer powerful to conduct global information modeling in both spatial and channel dimensions. Extensive experiments show that Xformer, under the comparable model complexity, achieves state-of-the-art performance on the synthetic and real-world image denoising tasks. We also provide code and models at https://github.com/gladzhang/Xformer.

CVMar 25, 2023Code
Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm

Yichen Xie, Han Lu, Junchi Yan et al.

Given the large-scale data and the high annotation cost, pretraining-finetuning becomes a popular paradigm in multiple computer vision tasks. Previous research has covered both the unsupervised pretraining and supervised finetuning in this paradigm, while little attention is paid to exploiting the annotation budget for finetuning. To fill in this gap, we formally define this new active finetuning task focusing on the selection of samples for annotation in the pretraining-finetuning paradigm. We propose a novel method called ActiveFT for active finetuning task to select a subset of data distributing similarly with the entire unlabeled pool and maintaining enough diversity by optimizing a parametric model in the continuous space. We prove that the Earth Mover's distance between the distributions of the selected subset and the entire data pool is also reduced in this process. Extensive experiments show the leading performance and high efficiency of ActiveFT superior to baselines on both image classification and semantic segmentation. Our code is released at https://github.com/yichen928/ActiveFT.

IRApr 2, 2022Code
Modeling Dynamic User Preference via Dictionary Learning for Sequential Recommendation

Chao Chen, Dongsheng Li, Junchi Yan et al.

Capturing the dynamics in user preference is crucial to better predict user future behaviors because user preferences often drift over time. Many existing recommendation algorithms -- including both shallow and deep ones -- often model such dynamics independently, i.e., user static and dynamic preferences are not modeled under the same latent space, which makes it difficult to fuse them for recommendation. This paper considers the problem of embedding a user's sequential behavior into the latent space of user preferences, namely translating sequence to preference. To this end, we formulate the sequential recommendation task as a dictionary learning problem, which learns: 1) a shared dictionary matrix, each row of which represents a partial signal of user dynamic preferences shared across users; and 2) a posterior distribution estimator using a deep autoregressive model integrated with Gated Recurrent Unit (GRU), which can select related rows of the dictionary to represent a user's dynamic preferences conditioned on his/her past behaviors. Qualitative studies on the Netflix dataset demonstrate that the proposed method can capture the user preference drifts over time and quantitative studies on multiple real-world datasets demonstrate that the proposed method can achieve higher accuracy compared with state-of-the-art factorization and neural sequential recommendation methods. The code is available at https://github.com/cchao0116/S2PNM-TKDE2021.

AIJul 18, 2024Code
LinSATNet: The Positive Linear Satisfiability Neural Networks

Runzhong Wang, Yunhao Zhang, Ziao Guo et al.

Encoding constraints into neural networks is attractive. This paper studies how to introduce the popular positive linear satisfiability to neural networks. We propose the first differentiable satisfiability layer based on an extension of the classic Sinkhorn algorithm for jointly encoding multiple sets of marginal distributions. We further theoretically characterize the convergence property of the Sinkhorn algorithm for multiple marginals. In contrast to the sequential decision e.g.\ reinforcement learning-based solvers, we showcase our technique in solving constrained (specifically satisfiability) problems by one-shot neural networks, including i) a neural routing solver learned without supervision of optimal solutions; ii) a partial graph matching network handling graphs with unmatchable outliers on both sides; iii) a predictive network for financial portfolios with continuous constraints. To our knowledge, there exists no one-shot neural solver for these scenarios when they are formulated as satisfiability problems. Source code is available at https://github.com/Thinklab-SJTU/LinSATNet

AIApr 6, 2023
FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead

Kang Chen, Tao Han, Junchao Gong et al.

We present FengWu, an advanced data-driven global medium-range weather forecast system based on Artificial Intelligence (AI). Different from existing data-driven weather forecast methods, FengWu solves the medium-range forecast problem from a multi-modal and multi-task perspective. Specifically, a deep learning architecture equipped with model-specific encoder-decoders and cross-modal fusion Transformer is elaborately designed, which is learned under the supervision of an uncertainty loss to balance the optimization of different predictors in a region-adaptive manner. Besides this, a replay buffer mechanism is introduced to improve medium-range forecast performance. With 39-year data training based on the ERA5 reanalysis, FengWu is able to accurately reproduce the atmospheric dynamics and predict the future land and atmosphere states at 37 vertical levels on a 0.25° latitude-longitude resolution. Hindcasts of 6-hourly weather in 2018 based on ERA5 demonstrate that FengWu performs better than GraphCast in predicting 80\% of the 880 reported predictands, e.g., reducing the root mean square error (RMSE) of 10-day lead global z500 prediction from 733 to 651 $m^{2}/s^2$. In addition, the inference cost of each iteration is merely 600ms on NVIDIA Tesla A100 hardware. The results suggest that FengWu can significantly improve the forecast skill and extend the skillful global medium-range weather forecast out to 10.75 days lead (with ACC of z500 > 0.6) for the first time.

LGApr 28, 2022Code
DOTIN: Dropping Task-Irrelevant Nodes for GNNs

Shaofeng Zhang, Feng Zhu, Junchi Yan et al.

Scalability is an important consideration for deep graph neural networks. Inspired by the conventional pooling layers in CNNs, many recent graph learning approaches have introduced the pooling strategy to reduce the size of graphs for learning, such that the scalability and efficiency can be improved. However, these pooling-based methods are mainly tailored to a single graph-level task and pay more attention to local information, limiting their performance in multi-task settings which often require task-specific global information. In this paper, departure from these pooling-based efforts, we design a new approach called DOTIN (\underline{D}r\underline{o}pping \underline{T}ask-\underline{I}rrelevant \underline{N}odes) to reduce the size of graphs. Specifically, by introducing $K$ learnable virtual nodes to represent the graph embeddings targeted to $K$ different graph-level tasks, respectively, up to 90\% raw nodes with low attentiveness with an attention model -- a transformer in this paper, can be adaptively dropped without notable performance decreasing. Achieving almost the same accuracy, our method speeds up GAT by about 50\% on graph-level tasks including graph classification and graph edit distance (GED) with about 60\% less memory, on D\&D dataset. Code will be made publicly available in https://github.com/Sherrylone/DOTIN.

CVJul 6, 2022
Perceptual Quality Assessment of Omnidirectional Images

Huiyu Duan, Guangtao Zhai, Xiongkuo Min et al.

Omnidirectional images and videos can provide immersive experience of real-world scenes in Virtual Reality (VR) environment. We present a perceptual omnidirectional image quality assessment (IQA) study in this paper since it is extremely important to provide a good quality of experience under the VR environment. We first establish an omnidirectional IQA (OIQA) database, which includes 16 source images and 320 distorted images degraded by 4 commonly encountered distortion types, namely JPEG compression, JPEG2000 compression, Gaussian blur and Gaussian noise. Then a subjective quality evaluation study is conducted on the OIQA database in the VR environment. Considering that humans can only see a part of the scene at one movement in the VR environment, visual attention becomes extremely important. Thus we also track head and eye movement data during the quality rating experiments. The original and distorted omnidirectional images, subjective quality ratings, and the head and eye movement data together constitute the OIQA database. State-of-the-art full-reference (FR) IQA measures are tested on the OIQA database, and some new observations different from traditional IQA are made.

CVOct 13, 2022
Adv-Attribute: Inconspicuous and Transferable Adversarial Attack on Face Recognition

Shuai Jia, Bangjie Yin, Taiping Yao et al.

Deep learning models have shown their vulnerability when dealing with adversarial attacks. Existing attacks almost perform on low-level instances, such as pixels and super-pixels, and rarely exploit semantic clues. For face recognition attacks, existing methods typically generate the l_p-norm perturbations on pixels, however, resulting in low attack transferability and high vulnerability to denoising defense models. In this work, instead of performing perturbations on the low-level pixels, we propose to generate attacks through perturbing on the high-level semantics to improve attack transferability. Specifically, a unified flexible framework, Adversarial Attributes (Adv-Attribute), is designed to generate inconspicuous and transferable attacks on face recognition, which crafts the adversarial noise and adds it into different attributes based on the guidance of the difference in face recognition features from the target. Moreover, the importance-aware attribute selection and the multi-objective optimization strategy are introduced to further ensure the balance of stealthiness and attacking strength. Extensive experiments on the FFHQ and CelebA-HQ datasets show that the proposed Adv-Attribute method achieves the state-of-the-art attacking success rates while maintaining better visual effects against recent attack methods.

CVJul 4, 2022
A Survey on Label-efficient Deep Image Segmentation: Bridging the Gap between Weak Supervision and Dense Prediction

Wei Shen, Zelin Peng, Xuehui Wang et al.

The rapid development of deep learning has made a great progress in image segmentation, one of the fundamental tasks of computer vision. However, the current segmentation algorithms mostly rely on the availability of pixel-level annotations, which are often expensive, tedious, and laborious. To alleviate this burden, the past years have witnessed an increasing attention in building label-efficient, deep-learning-based image segmentation algorithms. This paper offers a comprehensive review on label-efficient image segmentation methods. To this end, we first develop a taxonomy to organize these methods according to the supervision provided by different types of weak labels (including no supervision, inexact supervision, incomplete supervision and inaccurate supervision) and supplemented by the types of segmentation problems (including semantic segmentation, instance segmentation and panoptic segmentation). Next, we summarize the existing label-efficient image segmentation methods from a unified perspective that discusses an important question: how to bridge the gap between weak supervision and dense prediction -- the current methods are mostly based on heuristic priors, such as cross-pixel similarity, cross-label constraint, cross-view consistency, and cross-image relation. Finally, we share our opinions about the future research directions for label-efficient deep image segmentation.

CVApr 11, 2022
Confusing Image Quality Assessment: Towards Better Augmented Reality Experience

Huiyu Duan, Xiongkuo Min, Yucheng Zhu et al.

With the development of multimedia technology, Augmented Reality (AR) has become a promising next-generation mobile platform. The primary value of AR is to promote the fusion of digital contents and real-world environments, however, studies on how this fusion will influence the Quality of Experience (QoE) of these two components are lacking. To achieve better QoE of AR, whose two layers are influenced by each other, it is important to evaluate its perceptual quality first. In this paper, we consider AR technology as the superimposition of virtual scenes and real scenes, and introduce visual confusion as its basic theory. A more general problem is first proposed, which is evaluating the perceptual quality of superimposed images, i.e., confusing image quality assessment. A ConFusing Image Quality Assessment (CFIQA) database is established, which includes 600 reference images and 300 distorted images generated by mixing reference images in pairs. Then a subjective quality perception study and an objective model evaluation experiment are conducted towards attaining a better understanding of how humans perceive the confusing images. An objective metric termed CFIQA is also proposed to better evaluate the confusing image quality. Moreover, an extended ARIQA study is further conducted based on the CFIQA study. We establish an ARIQA database to better simulate the real AR application scenarios, which contains 20 AR reference images, 20 background (BG) reference images, and 560 distorted images generated from AR and BG references, as well as the correspondingly collected subjective quality ratings. We also design three types of full-reference (FR) IQA metrics to study whether we should consider the visual confusion when designing corresponding IQA algorithms. An ARIQA metric is finally proposed for better evaluating the perceptual quality of AR images.

CVMar 18, 2022
Facial Geometric Detail Recovery via Implicit Representation

Xingyu Ren, Alexandros Lattas, Baris Gecer et al.

Learning a dense 3D model with fine-scale details from a single facial image is highly challenging and ill-posed. To address this problem, many approaches fit smooth geometries through facial prior while learning details as additional displacement maps or personalized basis. However, these techniques typically require vast datasets of paired multi-view data or 3D scans, whereas such datasets are scarce and expensive. To alleviate heavy data dependency, we present a robust texture-guided geometric detail recovery approach using only a single in-the-wild facial image. More specifically, our method combines high-quality texture completion with the powerful expressiveness of implicit surfaces. Initially, we inpaint occluded facial parts, generate complete textures, and build an accurate multi-view dataset of the same subject. In order to estimate the detailed geometry, we define an implicit signed distance function and employ a physically-based implicit renderer to reconstruct fine geometric details from the generated multi-view images. Our method not only recovers accurate facial details but also decomposes normals, albedos, and shading parts in a self-supervised way. Finally, we register the implicit shape details to a 3D Morphable Model template, which can be used in traditional modeling and rendering pipelines. Extensive experiments demonstrate that the proposed approach can reconstruct impressive facial details from a single image, especially when compared with state-of-the-art methods trained on large datasets.

CVMar 15, 2022
ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation

Liang Xu, Ziyang Song, Dongliang Wang et al.

We present a GAN-based Transformer for general action-conditioned 3D human motion generation, including not only single-person actions but also multi-person interactive actions. Our approach consists of a powerful Action-conditioned motion TransFormer (ActFormer) under a GAN training scheme, equipped with a Gaussian Process latent prior. Such a design combines the strong spatio-temporal representation capacity of Transformer, superiority in generative modeling of GAN, and inherent temporal correlations from the latent prior. Furthermore, ActFormer can be naturally extended to multi-person motions by alternately modeling temporal correlations and human interactions with Transformer encoders. To further facilitate research on multi-person motion generation, we introduce a new synthetic dataset of complex multi-person combat behaviors. Extensive experiments on NTU-13, NTU RGB+D 120, BABEL and the proposed combat dataset show that our method can adapt to various human motion representations and achieve superior performance over the state-of-the-art methods on both single-person and multi-person motion generation tasks, demonstrating a promising step towards a general human motion generator.

CVOct 3, 2022
Perceptual Attacks of No-Reference Image Quality Models with Human-in-the-Loop

Weixia Zhang, Dingquan Li, Xiongkuo Min et al.

No-reference image quality assessment (NR-IQA) aims to quantify how humans perceive visual distortions of digital images without access to their undistorted references. NR-IQA models are extensively studied in computational vision, and are widely used for performance evaluation and perceptual optimization of man-made vision systems. Here we make one of the first attempts to examine the perceptual robustness of NR-IQA models. Under a Lagrangian formulation, we identify insightful connections of the proposed perceptual attack to previous beautiful ideas in computer vision and machine learning. We test one knowledge-driven and three data-driven NR-IQA methods under four full-reference IQA models (as approximations to human perception of just-noticeable differences). Through carefully designed psychophysical experiments, we find that all four NR-IQA models are vulnerable to the proposed perceptual attack. More interestingly, we observe that the generated counterexamples are not transferable, manifesting themselves as distinct design flows of respective NR-IQA methods.

CVMar 15, 2022
DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation

Yichao Yan, Zanwei Zhou, Zi Wang et al.

Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily conversations, while most existing methods focused on single-person talking head generation. In this work, we take a step further and consider generating realistic face-to-face conversation videos. Conversation generation is more challenging than single-person talking head generation, since it not only requires generating photo-realistic individual talking heads but also demands the listener to respond to the speaker. In this paper, we propose a novel unified framework based on neural radiance field (NeRF) to address this task. Specifically, we model both the speaker and listener with a NeRF framework, with different conditions to control individual expressions. The speaker is driven by the audio signal, while the response of the listener depends on both visual and acoustic information. In this way, face-to-face conversation videos are generated between human avatars, with all the interlocutors modeled within the same network. Moreover, to facilitate future research on this task, we collect a new human conversation dataset containing 34 clips of videos. Quantitative and qualitative experiments evaluate our method in different aspects, e.g., image quality, pose sequence trend, and naturalness of the rendering videos. Experimental results demonstrate that the avatars in the resulting videos are able to perform a realistic conversation, and maintain individual styles. All the code, data, and models will be made publicly available.

CVNov 24, 2023Code
Image Super-Resolution with Text Prompt Diffusion

Zheng Chen, Yulun Zhang, Jinjin Gu et al.

Image super-resolution (SR) methods typically model degradation to improve reconstruction accuracy in complex and unknown degradation scenarios. However, extracting degradation information from low-resolution images is challenging, which limits the model performance. To boost image SR performance, one feasible approach is to introduce additional priors. Inspired by advancements in multi-modal methods and text prompt image processing, we introduce text prompts to image SR to provide degradation priors. Specifically, we first design a text-image generation pipeline to integrate text into the SR dataset through the text degradation representation and degradation model. By adopting a discrete design, the text representation is flexible and user-friendly. Meanwhile, we propose the PromptSR to realize the text prompt SR. The PromptSR leverages the latest multi-modal large language model (MLLM) to generate prompts from low-resolution images. It also utilizes the pre-trained language model (e.g., T5 or CLIP) to enhance text comprehension. We train the PromptSR on the text-image dataset. Extensive experiments indicate that introducing text prompts into SR, yields impressive results on both synthetic and real-world images. Code: https://github.com/zhengchen1999/PromptSR.

CVNov 1, 2022Code
Self-supervised Character-to-Character Distillation for Text Recognition

Tongkun Guan, Wei Shen, Xue Yang et al.

When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code is available at https://github.com/TongkunGuan/CCD.

CVMar 21, 2022
EAutoDet: Efficient Architecture Search for Object Detection

Xiaoxing Wang, Jiale Lin, Junchi Yan et al.

Training CNN for detection is time-consuming due to the large dataset and complex network modules, making it hard to search architectures on detection datasets directly, which usually requires vast search costs (usually tens and even hundreds of GPU-days). In contrast, this paper introduces an efficient framework, named EAutoDet, that can discover practical backbone and FPN architectures for object detection in 1.4 GPU-days. Specifically, we construct a supernet for both backbone and FPN modules and adopt the differentiable method. To reduce the GPU memory requirement and computational cost, we propose a kernel reusing technique by sharing the weights of candidate operations on one edge and consolidating them into one convolution. A dynamic channel refinement strategy is also introduced to search channel numbers. Extensive experiments show significant efficacy and efficiency of our method. In particular, the discovered architectures surpass state-of-the-art object detection NAS methods and achieve 40.1 mAP with 120 FPS and 49.2 mAP with 41.3 FPS on COCO test-dev set. We also transfer the discovered architectures to rotation detection task, which achieve 77.05 mAP$_{\text{50}}$ on DOTA-v1.0 test set with 21.1M parameters.

CVSep 30, 2023Code
Domain-Controlled Prompt Learning

Qinglong Cao, Zhengqin Xu, Yuntian Chen et al.

Large pre-trained vision-language models, such as CLIP, have shown remarkable generalization capabilities across various tasks when appropriate text prompts are provided. However, adapting these models to specific domains, like remote sensing images (RSIs), medical images, etc, remains unexplored and challenging. Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms, leading to suboptimal performance due to the misinterpretation of specific images in natural image patterns. To tackle this dilemma, we proposed a \textbf{Domain-Controlled Prompt Learning} for the specific domains. Specifically, the large-scale specific domain foundation model (LSDM) is first introduced to provide essential specific domain knowledge. Using lightweight neural networks, we transfer this knowledge into domain biases, which control both the visual and language branches to obtain domain-adaptive prompts in a directly incorporating manner. Simultaneously, to overcome the existing overfitting challenge, we propose a novel noisy-adding strategy, without extra trainable parameters, to help the model escape the suboptimal solution in a global domain oscillation manner. Experimental results show our method achieves state-of-the-art performance in specific domain image recognition datasets. Our code is available at https://github.com/caoql98/DCPL.

CVMar 29, 2022
Exploring Frequency Adversarial Attacks for Face Forgery Detection

Shuai Jia, Chao Ma, Taiping Yao et al.

Various facial manipulation techniques have drawn serious public concerns in morality, security, and privacy. Although existing face forgery classifiers achieve promising performance on detecting fake images, these methods are vulnerable to adversarial examples with injected imperceptible perturbations on the pixels. Meanwhile, many face forgery detectors always utilize the frequency diversity between real and fake faces as a crucial clue. In this paper, instead of injecting adversarial perturbations into the spatial domain, we propose a frequency adversarial attack method against face forgery detectors. Concretely, we apply discrete cosine transform (DCT) on the input images and introduce a fusion module to capture the salient region of adversary in the frequency domain. Compared with existing adversarial attacks (e.g. FGSM, PGD) in the spatial domain, our method is more imperceptible to human observers and does not degrade the visual quality of the original images. Moreover, inspired by the idea of meta-learning, we also propose a hybrid adversarial attack that performs attacks in both the spatial and frequency domains. Extensive experiments indicate that the proposed method fools not only the spatial-based detectors but also the state-of-the-art frequency-based detectors effectively. In addition, the proposed frequency attack enhances the transferability across face forgery detectors as black-box attacks.

LGMar 22, 2023
EasyDGL: Encode, Train and Interpret for Continuous-time Dynamic Graph Learning

Chao Chen, Haoyu Geng, Nianzu Yang et al.

Dynamic graphs arise in various real-world applications, and it is often welcomed to model the dynamics directly in continuous time domain for its flexibility. This paper aims to design an easy-to-use pipeline (termed as EasyDGL which is also due to its implementation by DGL toolkit) composed of three key modules with both strong fitting ability and interpretability. Specifically the proposed pipeline which involves encoding, training and interpreting: i) a temporal point process (TPP) modulated attention architecture to endow the continuous-time resolution with the coupled spatiotemporal dynamics of the observed graph with edge-addition events; ii) a principled loss composed of task-agnostic TPP posterior maximization based on observed events on the graph, and a task-aware loss with a masking strategy over dynamic graph, where the covered tasks include dynamic link prediction, dynamic node classification and node traffic forecasting; iii) interpretation of the model outputs (e.g., representations and predictions) with scalable perturbation-based quantitative analysis in the graph Fourier domain, which could more comprehensively reflect the behavior of the learned model. Extensive experimental results on public benchmarks show the superior performance of our EasyDGL for time-conditioned predictive tasks, and in particular demonstrate that EasyDGL can effectively quantify the predictive power of frequency content that a model learn from the evolving graph data.

CVMar 28, 2023
Head3D: Complete 3D Head Generation via Tri-plane Feature Distillation

Yuhao Cheng, Yichao Yan, Wenhan Zhu et al.

Head generation with diverse identities is an important task in computer vision and computer graphics, widely used in multimedia applications. However, current full head generation methods require a large number of 3D scans or multi-view images to train the model, resulting in expensive data acquisition cost. To address this issue, we propose Head3D, a method to generate full 3D heads with limited multi-view images. Specifically, our approach first extracts facial priors represented by tri-planes learned in EG3D, a 3D-aware generative model, and then proposes feature distillation to deliver the 3D frontal faces into complete heads without compromising head integrity. To mitigate the domain gap between the face and head models, we present dual-discriminators to guide the frontal and back head generation, respectively. Our model achieves cost-efficient and diverse complete head generation with photo-realistic renderings and high-quality geometry representations. Extensive experiments demonstrate the effectiveness of our proposed Head3D, both qualitatively and quantitatively.

CVJul 10, 2024Code
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Tongkun Guan, Chengyu Lin, Wei Shen et al.

Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer.

CVSep 12, 2024Code
Open-Vocabulary Remote Sensing Image Semantic Segmentation

Qinglong Cao, Yuntian Chen, Chao Ma et al.

Open-vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision-language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation-aggregative similarity computation module that generates orientation-adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi-scale image features into the upsampling process, resulting in the final scale-aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state-of-the-art performance. All codes and datasets are available at https://github.com/caoql98/OVRS.

LGSep 6, 2024Code
Learning to Solve Combinatorial Optimization under Positive Linear Constraints via Non-Autoregressive Neural Networks

Runzhong Wang, Yang Li, Junchi Yan et al.

Combinatorial optimization (CO) is the fundamental problem at the intersection of computer science, applied mathematics, etc. The inherent hardness in CO problems brings up challenge for solving CO exactly, making deep-neural-network-based solvers a research frontier. In this paper, we design a family of non-autoregressive neural networks to solve CO problems under positive linear constraints with the following merits. First, the positive linear constraint covers a wide range of CO problems, indicating that our approach breaks the generality bottleneck of existing non-autoregressive networks. Second, compared to existing autoregressive neural network solvers, our non-autoregressive networks have the advantages of higher efficiency and preserving permutation invariance. Third, our offline unsupervised learning has lower demand on high-quality labels, getting rid of the demand of optimal labels in supervised learning. Fourth, our online differentiable search method significantly improves the generalizability of our neural network solver to unseen problems. We validate the effectiveness of this framework in solving representative CO problems including facility location, max-set covering, and traveling salesman problem. Our non-autoregressive neural solvers are competitive to and can be even superior to state-of-the-art solvers such as SCIP and Gurobi, especially when both efficiency and efficacy are considered. Code is available at https://github.com/Thinklab-SJTU/NAR-CO-Solver

CVApr 19, 2023
HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks

Zhuo Chen, Xudong Xu, Yichao Yan et al.

Portrait stylization is a long-standing task enabling extensive applications. Although 2D-based methods have made great progress in recent years, real-world applications such as metaverse and games often demand 3D content. On the other hand, the requirement of 3D data, which is costly to acquire, significantly impedes the development of 3D portrait stylization methods. In this paper, inspired by the success of 3D-aware GANs that bridge 2D and 3D domains with 3D fields as the intermediate representation for rendering 2D images, we propose a novel method, dubbed HyperStyle3D, based on 3D-aware GANs for 3D portrait stylization. At the core of our method is a hyper-network learned to manipulate the parameters of the generator in a single forward pass. It not only offers a strong capacity to handle multiple styles with a single model, but also enables flexible fine-grained stylization that affects only texture, shape, or local part of the portrait. While the use of 3D-aware GANs bypasses the requirement of 3D data, we further alleviate the necessity of style images with the CLIP model being the stylization guidance. We conduct an extensive set of experiments across the style, attribute, and shape, and meanwhile, measure the 3D consistency. These experiments demonstrate the superior capability of our HyperStyle3D model in rendering 3D-consistent images in diverse styles, deforming the face shape, and editing various attributes.

LGFeb 8, 2023
Graph Signal Sampling for Inductive One-Bit Matrix Completion: a Closed-form Solution

Chao Chen, Haoyu Geng, Gang Zeng et al.

Inductive one-bit matrix completion is motivated by modern applications such as recommender systems, where new users would appear at test stage with the ratings consisting of only ones and no zeros. We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (signal) on the vertices of an item-item graph, then learn structural graph properties to recover the function from its values on certain vertices -- the problem of graph signal sampling. We propose a class of regularization functionals that takes into account discrete random label noise in the graph vertex domain, then develop the GS-IMC approach which biases the reconstruction towards functions that vary little between adjacent vertices for noise reduction. Theoretical result shows that accurate reconstructions can be achieved under mild conditions. For the online setting, we develop a Bayesian extension, i.e., BGS-IMC which considers continuous random Gaussian noise in the graph Fourier domain and builds upon a prediction-correction update algorithm to obtain the unbiased and minimum-variance reconstruction. Both GS-IMC and BGS-IMC have closed-form solutions and thus are highly scalable in large data. Experiments show that our methods achieve state-of-the-art performance on public benchmarks.

79.5CVMay 20Code
VDFP: Video Deflickering with Flicker-banding Priors

Zhiyi Zhou, Libo Zhu, Zihan Zhou et al.

Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available datasets.Then we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at~ https://github.com/ZhiyiZZhou/VDFP.

CVJul 17, 2024
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

Xintao Lv, Liang Xu, Yichao Yan et al.

Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08M 3D HOI frames. We also annotate HIMO with detailed textual descriptions and temporal segments, benchmarking two novel tasks of HOI synthesis conditioned on either the whole text prompt or the segmented text prompts as fine-grained timeline control. To address these novel tasks, we propose a dual-branch conditional diffusion model with a mutual interaction module for HOI synthesis. Besides, an auto-regressive generation pipeline is also designed to obtain smooth transitions between HOI segments. Experimental results demonstrate the generalization ability to unseen object geometries and temporal compositions.

IVApr 5, 2023
DRAC: Diabetic Retinopathy Analysis Challenge with Ultra-Wide Optical Coherence Tomography Angiography Images

Bo Qian, Hao Chen, Xiangning Wang et al.

Computer-assisted automatic analysis of diabetic retinopathy (DR) is of great importance in reducing the risks of vision loss and even blindness. Ultra-wide optical coherence tomography angiography (UW-OCTA) is a non-invasive and safe imaging modality in DR diagnosis system, but there is a lack of publicly available benchmarks for model development and evaluation. To promote further research and scientific benchmarking for diabetic retinopathy analysis using UW-OCTA images, we organized a challenge named "DRAC - Diabetic Retinopathy Analysis Challenge" in conjunction with the 25th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2022). The challenge consists of three tasks: segmentation of DR lesions, image quality assessment and DR grading. The scientific community responded positively to the challenge, with 11, 12, and 13 teams from geographically diverse institutes submitting different solutions in these three tasks, respectively. This paper presents a summary and analysis of the top-performing solutions and results for each task of the challenge. The obtained results from top algorithms indicate the importance of data augmentation, model architecture and ensemble of networks in improving the performance of deep learning models. These findings have the potential to enable new developments in diabetic retinopathy analysis. The challenge remains open for post-challenge registrations and submissions for benchmarking future methodology developments.

CVApr 12, 2022
Continual Predictive Learning from Videos

Geng Chen, Wendong Zhang, Han Lu et al.

Predictive learning ideally builds the world model of physical processes in one or more given environments. Typical setups assume that we can collect data from all environments at all times. In practice, however, different prediction tasks may arrive sequentially so that the environments may change persistently throughout the training procedure. Can we develop predictive learning algorithms that can deal with more realistic, non-stationary physical environments? In this paper, we study a new continual learning problem in the context of video prediction, and observe that most existing methods suffer from severe catastrophic forgetting in this setup. To tackle this problem, we propose the continual predictive learning (CPL) approach, which learns a mixture world model via predictive experience replay and performs test-time adaptation with non-parametric task inference. We construct two new benchmarks based on RoboNet and KTH, in which different tasks correspond to different physical robotic environments or human actions. Our approach is shown to effectively mitigate forgetting and remarkably outperform the naïve combinations of previous art in video prediction and continual learning.

CVNov 30, 2025Code
MM-ACT: Learn from Multimodal Parallel Generation to Act

Haotian Liang, Xinyi Chen, Bin Wang et al.

A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.

CVAug 28, 2023
SAM-PARSER: Fine-tuning SAM Efficiently by Parameter Space Reconstruction

Zelin Peng, Zhengqin Xu, Zhilin Zeng et al.

Segment Anything Model (SAM) has received remarkable attention as it offers a powerful and versatile solution for object segmentation in images. However, fine-tuning SAM for downstream segmentation tasks under different scenarios remains a challenge, as the varied characteristics of different scenarios naturally requires diverse model parameter spaces. Most existing fine-tuning methods attempt to bridge the gaps among different scenarios by introducing a set of new parameters to modify SAM's original parameter space. Unlike these works, in this paper, we propose fine-tuning SAM efficiently by parameter space reconstruction (SAM-PARSER), which introduce nearly zero trainable parameters during fine-tuning. In SAM-PARSER, we assume that SAM's original parameter space is relatively complete, so that its bases are able to reconstruct the parameter space of a new scenario. We obtain the bases by matrix decomposition, and fine-tuning the coefficients to reconstruct the parameter space tailored to the new scenario by an optimal linear combination of the bases. Experimental results show that SAM-PARSER exhibits superior segmentation performance across various scenarios, while reducing the number of trainable parameters by $\approx 290$ times compared with current parameter-efficient fine-tuning methods.

CVApr 13, 2023
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds

Chen Yang, Peihao Li, Zanwei Zhou et al.

We present NeRFVS, a novel neural radiance fields (NeRF) based method to enable free navigation in a room. NeRF achieves impressive performance in rendering images for novel views similar to the input views while suffering for novel views that are significantly different from the training views. To address this issue, we utilize the holistic priors, including pseudo depth maps and view coverage information, from neural reconstruction to guide the learning of implicit neural representations of 3D indoor scenes. Concretely, an off-the-shelf neural reconstruction method is leveraged to generate a geometry scaffold. Then, two loss functions based on the holistic priors are proposed to improve the learning of NeRF: 1) A robust depth loss that can tolerate the error of the pseudo depth map to guide the geometry learning of NeRF; 2) A variance loss to regularize the variance of implicit neural representations to reduce the geometry and color ambiguity in the learning procedure. These two loss functions are modulated during NeRF optimization according to the view coverage information to reduce the negative influence brought by the view coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms state-of-the-art view synthesis methods quantitatively and qualitatively on indoor scenes, achieving high-fidelity free navigation results.

CVFeb 4Code
Light Up Your Face: A Physically Consistent Dataset and Diffusion Model for Face Fill-Light Enhancement

Jue Gong, Zihan Zhou, Jingkai Wang et al.

Face fill-light enhancement (FFE) brightens underexposed faces by adding virtual fill light while keeping the original scene illumination and background unchanged. Most face relighting methods aim to reshape overall lighting, which can suppress the input illumination or modify the entire scene, leading to foreground-background inconsistency and mismatching practical FFE needs. To support scalable learning, we introduce LightYourFace-160K (LYF-160K), a large-scale paired dataset built with a physically consistent renderer that injects a disk-shaped area fill light controlled by six disentangled factors, producing 160K before-and-after pairs. We first pretrain a physics-aware lighting prompt (PALP) that embeds the 6D parameters into conditioning tokens, using an auxiliary planar-light reconstruction objective. Building on a pretrained diffusion backbone, we then train a fill-light diffusion (FiLitDiff), an efficient one-step model conditioned on physically grounded lighting codes, enabling controllable and high-fidelity fill lighting at low computational cost. Experiments on held-out paired sets demonstrate strong perceptual quality and competitive full-reference metrics, while better preserving background illumination. The dataset and model will be at https://github.com/gobunu/Light-Up-Your-Face.

LGDec 6, 2022
A K-variate Time Series Is Worth K Words: Evolution of the Vanilla Transformer Architecture for Long-term Multivariate Time Series Forecasting

Zanwei Zhou, Ruizhe Zhong, Chen Yang et al.

Multivariate time series forecasting (MTSF) is a fundamental problem in numerous real-world applications. Recently, Transformer has become the de facto solution for MTSF, especially for the long-term cases. However, except for the one forward operation, the basic configurations in existing MTSF Transformer architectures were barely carefully verified. In this study, we point out that the current tokenization strategy in MTSF Transformer architectures ignores the token uniformity inductive bias of Transformers. Therefore, the vanilla MTSF transformer struggles to capture details in time series and presents inferior performance. Based on this observation, we make a series of evolution on the basic architecture of the vanilla MTSF transformer. We vary the flawed tokenization strategy, along with the decoder structure and embeddings. Surprisingly, the evolved simple transformer architecture is highly effective, which successfully avoids the over-smoothing phenomena in the vanilla MTSF transformer, achieves a more detailed and accurate prediction, and even substantially outperforms the state-of-the-art Transformers that are well-designed for MTSF.

CVJun 1, 2023Code
Reflection Invariance Learning for Few-shot Semantic Segmentation

Qinglong Cao, Yuntian Chen, Chao Ma et al.

Few-shot semantic segmentation (FSS) aims to segment objects of unseen classes in query images with only a few annotated support images. Existing FSS algorithms typically focus on mining category representations from the single-view support to match semantic objects of the single-view query. However, the limited annotated samples render the single-view matching struggle to perceive the reflection invariance of novel objects, which results in a restricted learning space for novel categories and further induces a biased segmentation with demoted parsing performance. To address this challenge, this paper proposes a fresh few-shot segmentation framework to mine the reflection invariance in a multi-view matching manner. Specifically, original and reflection support features from different perspectives with the same semantics are learnable fused to obtain the reflection invariance prototype with a stronger category representation ability. Simultaneously, aiming at providing better prior guidance, the Reflection Invariance Prior Mask Generation (RIPMG) module is proposed to integrate prior knowledge from different perspectives. Finally, segmentation predictions from varying views are complementarily merged in the Reflection Invariance Semantic Prediction (RISP) module to yield precise segmentation predictions. Extensive experiments on both PASCAL-$5^\textit{i}$ and COCO-$20^\textit{i}$ datasets demonstrate the effectiveness of our approach and show that our method could achieve state-of-the-art performance. Code is available at \url{https://anonymous.4open.science/r/RILFS-A4D1}

99.5CVMar 11Code
CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan, Zhibo Yang, Jianqiang Wan et al.

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

CVFeb 13Code
Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

Yichen Zhao, Zelin Peng, Piao Yang et al.

Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6\% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at https://github.com/qiuzyc/thinking_like_a_radiologist.

CVJan 15
Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs

Ningyu Sun, Zhaolin Cai, Zitong Xu et al.

Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.

LGMay 27, 2022
Iso-Dream: Isolating and Leveraging Noncontrollable Visual Dynamics in World Models

Minting Pan, Xiangming Zhu, Yunbo Wang et al.

World models learn the consequences of actions in vision-based interactive systems. However, in practical scenarios such as autonomous driving, there commonly exists noncontrollable dynamics independent of the action signals, making it difficult to learn effective world models. To tackle this problem, we present a novel reinforcement learning approach named Iso-Dream, which improves the Dream-to-Control framework in two aspects. First, by optimizing the inverse dynamics, we encourage the world model to learn controllable and noncontrollable sources of spatiotemporal changes on isolated state transition branches. Second, we optimize the behavior of the agent on the decoupled latent imaginations of the world model. Specifically, to estimate state values, we roll-out the noncontrollable states into the future and associate them with the current controllable state. In this way, the isolation of dynamics sources can greatly benefit long-horizon decision-making of the agent, such as a self-driving car that can avoid potential risks by anticipating the movement of other vehicles. Experiments show that Iso-Dream is effective in decoupling the mixed dynamics and remarkably outperforms existing approaches in a wide range of visual control and prediction domains.

CVSep 26, 2023
Directional Texture Editing for 3D Models

Shengqi Liu, Zhuo Chen, Jingnan Gao et al.

Texture editing is a crucial task in 3D modeling that allows users to automatically manipulate the surface materials of 3D models. However, the inherent complexity of 3D models and the ambiguous text description lead to the challenge in this task. To address this challenge, we propose ITEM3D, a \textbf{T}exture \textbf{E}diting \textbf{M}odel designed for automatic \textbf{3D} object editing according to the text \textbf{I}nstructions. Leveraging the diffusion models and the differentiable rendering, ITEM3D takes the rendered images as the bridge of text and 3D representation, and further optimizes the disentangled texture and environment map. Previous methods adopted the absolute editing direction namely score distillation sampling (SDS) as the optimization objective, which unfortunately results in the noisy appearance and text inconsistency. To solve the problem caused by the ambiguous text, we introduce a relative editing direction, an optimization objective defined by the noise difference between the source and target texts, to release the semantic ambiguity between the texts and images. Additionally, we gradually adjust the direction during optimization to further address the unexpected deviation in the texture domain. Qualitative and quantitative experiments show that our ITEM3D outperforms the state-of-the-art methods on various 3D objects. We also perform text-guided relighting to show explicit control over lighting. Our project page: https://shengqiliu1.github.io/ITEM3D.

81.6LGMay 26
Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

Haochen Yuan, Yichen Song, Yunbo Wang et al.

Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore critical inter-channel dependencies, while channel-dependent models are expressive but remain ``dimension-bounded'', struggling to generalize across heterogeneous datasets.To bridge this gap, we introduce Unicorn (Universal Correlation Network), a framework for scalable, multi-dataset pretraining on high-dimensional time series. At the core of Unicorn is a latent prototype codebook that decouples correlation modeling from specific channel identities. By projecting heterogeneous channels into a shared latent space, UniCorN learns identity-agnostic, reusable interaction patterns that transfer across domains with diverse dimensionalities and semantics. Extensive experiments show that Unicorn significantly outperforms state-of-the-art forecasting architectures, particularly in few-shot transfer scenarios, offering a scalable path toward multivariate time series foundation models.

CVMar 7, 2022
Self-supervised Implicit Glyph Attention for Text Recognition

Tongkun Guan, Chaochen Gu, Jingzheng Tu et al.

The attention mechanism has become the \emph{de facto} module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text annotations and or character-level bounding box annotations, respectively. Implicit attention, as it may extract coarse or even incorrect spatial regions as character attention, is prone to suffering from an alignment-drifted issue. Supervised attention can alleviate the above issue, but it is character category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when handling languages with larger character categories. To address the aforementioned issues, we propose a novel attention mechanism for STR, self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment, which serve as the supervision to improve attention correctness without extra character-level annotations. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods, in terms of both attention correctness and final recognition performance on publicly available context benchmarks and our contributed contextless benchmarks.

88.9CVMay 13Code
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

Zihang Xu, Xiaoyang Liu, Zheng Chen et al.

Text image super-resolution (Text-SR) requires more than visually plausible detail synthesis: slight errors in stroke topology may alter character identity and break readability. Existing methods improve text fidelity with stronger recognition-based or generative priors, yet they still face two unresolved challenges under severe degradation: the text condition extracted from low-quality inputs can itself be unreliable, and a plausible global prior does not fully determine fine-grained stroke boundaries. We present PRISM, a single-step diffusion-based Text-SR framework that addresses these two challenges through Flow-Matching Prior Rectification (FMPR) and a Structure-guided Uncertainty-aware Residual Encoder (SURE). FMPR constructs a privileged training-time prior from paired low-quality/high-quality latents and learns a flow matching that transports degraded embeddings toward this restoration-oriented prior space, yielding more accurate and reliable global text guidance. SURE further predicts uncertainty-aware structural residuals to selectively absorb reliable local boundary evidence while suppressing ambiguous stroke cues. Together, these components enable explicit global prior rectification and local structure refinement within a single diffusion restoration pass. Experiments on both synthetic and real-world benchmarks show that PRISM achieves state-of-the-art performance with millisecond-level inference. Our dataset and code will be available at https://github.com/faithxuz/PRISM.

LGMar 3, 2022
NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

Shanyan Guan, Huayu Deng, Yunbo Wang et al.

Deep learning has shown great potential for modeling the physical dynamics of complex particle systems such as fluids. Existing approaches, however, require the supervision of consecutive particle properties, including positions and velocities. In this paper, we consider a partially observable scenario known as fluid dynamics grounding, that is, inferring the state transitions and interactions within the fluid particle systems from sequential visual observations of the fluid surface. We propose a differentiable two-stage network named NeuroFluid. Our approach consists of (i) a particle-driven neural renderer, which involves fluid physical properties into the volume rendering function, and (ii) a particle transition model optimized to reduce the differences between the rendered and the observed images. NeuroFluid provides the first solution to unsupervised learning of particle-based fluid dynamics by training these two models jointly. It is shown to reasonably estimate the underlying physics of fluids with different initial shapes, viscosity, and densities.

CVOct 16, 2023
Generalizable Person Search on Open-world User-Generated Video Content

Junjie Li, Guanshuo Wang, Yichao Yan et al.

Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is often difficult due to the limitation of resources and the labor cost. Moreover, large-scale intra-domain data for training are generally not legally available for common developers, due to the regulation of privacy and public security. Leveraging easily accessible large-scale User Generated Video Contents (\emph{i.e.} UGC videos) to train person search models can fit the open-world distribution, but still suffering a performance gap from the domain difference to surveillance scenes. In this work, we explore enhancing the out-of-domain generalization capabilities of person search models, and propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios. Specifically, we focus on learning domain-invariant representations for both detection and ReID by introducing a multi-task prototype-based domain-specific batch normalization, and a channel-wise ID-relevant feature decorrelation strategy. We also identify and address typical sources of noise in open-world training frames, including inaccurate bounding boxes, the omission of identity labels, and the absence of cross-camera data. Our framework achieves promising performance on two challenging person search benchmarks without using any human annotation or samples from the target domain.

CVMar 12, 2023
Improving Masked Autoencoders by Learning Where to Mask

Haijian Chen, Wendong Zhang, Yunbo Wang et al.

Masked image modeling is a promising self-supervised learning method for visual data. It is typically built upon image patches with random masks, which largely ignores the variation of information density between them. The question is: Is there a better masking strategy than random sampling and how can we learn it? We empirically study this problem and initially find that introducing object-centric priors in mask sampling can significantly improve the learned representations. Inspired by this observation, we present AutoMAE, a fully differentiable framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In this way, our approach can adaptively find patches with higher information density for different images, and further strike a balance between the information gain obtained from image reconstruction and its practical training difficulty. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.