CVJan 1Code
Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data DistributionsAobo Li, Jinjian Wu, Yongxu Liu et al.
Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy's impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. The code is available at https://github.com/Li-aobo/SynDR-IQA.
CVJul 25, 2022
Seeking Subjectivity in Visual Emotion Distribution LearningJingyuan Yang, Jie Li, Leida Li et al.
Visual Emotion Analysis (VEA), which aims to predict people's emotions towards different visual stimuli, has become an attractive research topic recently. Rather than a single label classification task, it is more rational to regard VEA as a Label Distribution Learning (LDL) problem by voting from different individuals. Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process. In psychology, the \textit{Object-Appraisal-Emotion} model has demonstrated that each individual's emotion is affected by his/her subjective appraisal, which is further formed by the affective memory. Inspired by this, we propose a novel \textit{Subjectivity Appraise-and-Match Network (SAMNet)} to investigate the subjectivity in visual emotion distribution. To depict the diversity in crowd voting process, we first propose the \textit{Subjectivity Appraising} with multiple branches, where each branch simulates the emotion evocation process of a specific individual. Specifically, we construct the affective memory with an attention-based mechanism to preserve each individual's unique emotional experience. A subjectivity loss is further proposed to guarantee the divergence between different individuals. Moreover, we propose the \textit{Subjectivity Matching} with a matching loss, aiming at assigning unordered emotion labels to ordered individual predictions in a one-to-one correspondence with the Hungarian algorithm. Extensive experiments and comparisons are conducted on public visual emotion distribution datasets, and the results demonstrate that the proposed SAMNet consistently outperforms the state-of-the-art methods. Ablation study verifies the effectiveness of our method and visualization proves its interpretability.
MMNov 22, 2022
Dehazed Image Quality Evaluation: From Partial Discrepancy to Blind PerceptionWei Zhou, Ruizeng Zhang, Leida Li et al.
Image dehazing aims to restore spatial details from hazy images. There have emerged a number of image dehazing algorithms, designed to increase the visibility of those hazy images. However, much less work has been focused on evaluating the visual quality of dehazed images. In this paper, we propose a Reduced-Reference dehazed image quality evaluation approach based on Partial Discrepancy (RRPD) and then extend it to a No-Reference quality assessment metric with Blind Perception (NRBP). Specifically, inspired by the hierarchical characteristics of the human perceiving dehazed images, we introduce three groups of features: luminance discrimination, color appearance, and overall naturalness. In the proposed RRPD, the combined distance between a set of sender and receiver features is adopted to quantify the perceptually dehazed image quality. By integrating global and local channels from dehazed images, the RRPD is converted to NRBP which does not rely on any information from the references. Extensive experiment results on several dehazed image quality databases demonstrate that our proposed methods outperform state-of-the-art full-reference, reduced-reference, and no-reference quality assessment models. Furthermore, we show that the proposed dehazed image quality evaluation methods can be effectively applied to tune parameters for potential image dehazing algorithms.
IVOct 9, 2022
HVS Revisited: A Comprehensive Video Quality Assessment FrameworkAo-Xiang Zhang, Yuan-Gen Wang, Weixuan Tang et al.
Video quality is a primary concern for video service providers. In recent years, the techniques of video quality assessment (VQA) based on deep convolutional neural networks (CNNs) have been developed rapidly. Although existing works attempt to introduce the knowledge of the human visual system (HVS) into VQA, there still exhibit limitations that prevent the full exploitation of HVS, including an incomplete model by few characteristics and insufficient connections among these characteristics. To overcome these limitations, this paper revisits HVS with five representative characteristics, and further reorganizes their connections. Based on the revisited HVS, a no-reference VQA framework called HVS-5M (NRVQA framework with five modules simulating HVS with five characteristics) is proposed. It works in a domain-fusion design paradigm with advanced network structures. On the side of the spatial domain, the visual saliency module applies SAMNet to obtain a saliency map. And then, the content-dependency and the edge masking modules respectively utilize ConvNeXt to extract the spatial features, which have been attentively weighted by the saliency map for the purpose of highlighting those regions that human beings may be interested in. On the other side of the temporal domain, to supplement the static spatial features, the motion perception module utilizes SlowFast to obtain the dynamic temporal features. Besides, the temporal hysteresis module applies TempHyst to simulate the memory mechanism of human beings, and comprehensively evaluates the quality score according to the fusion features from the spatial and temporal domains. Extensive experiments show that our HVS-5M outperforms the state-of-the-art VQA methods. Ablation studies are further conducted to verify the effectiveness of each module towards the proposed framework.
CVApr 15, 2024Code
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics PerceptionYipo Huang, Xiangfei Sheng, Zhichao Yang et al.
The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal large language models (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project homepage: https://yipoh.github.io/aes-expert/.
CVJan 8
HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment AssessmentWenzhi Chen, Bo Hu, Leida Li et al.
With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.
CVJan 12, 2024Code
Video Super-Resolution Transformer with Masked Inter&Intra-Frame AttentionXingyu Zhou, Leheng Zhang, Xiaorui Zhao et al.
Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.
CVDec 10, 2025
Content-Adaptive Image Retouching Guided by Attribute-Based Text RepresentationHancheng Zhu, Xinyu Liu, Rui Yao et al.
Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.
CVMar 4
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative RanksZhichao Yang, Jianjie Wang, Zhixianhe Zhang et al.
Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method.
CVDec 10, 2025
LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured AnnotationsZhichao Yang, Tianjiao Gu, Jianjie Wang et al.
The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.
CVJan 16, 2024Code
AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics PerceptionYipo Huang, Quan Yuan, Xiangfei Sheng et al.
With collective endeavors, multimodal large language models (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind groping may impede the further development of more advanced MLLMs with aesthetic perception capacity. To address this dilemma, we propose AesBench, an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs through elaborate design across dual facets. (1) We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. (2) We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI). Extensive experimental results underscore that the current MLLMs only possess rudimentary aesthetic perception ability, and there is still a significant gap between MLLMs and humans. We hope this work can inspire the community to engage in deeper explorations on the aesthetic potentials of MLLMs. Source data will be available at https://github.com/yipoh/AesBench.
CVAug 22, 2021Code
StarVQA: Space-Time Attention for Video Quality AssessmentFengchuang Xing, Yuan-Gen Wang, Hanpin Wang et al.
The attention mechanism is blooming in computer vision nowadays. However, its application to video quality assessment (VQA) has not been reported. Evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a novel \underline{s}pace-\underline{t}ime \underline{a}ttention network fo\underline{r} the \underline{VQA} problem, named StarVQA. StarVQA builds a Transformer by alternately concatenating the divided space-time attention. To adapt the Transformer architecture for training, StarVQA designs a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special vectorized label token as the learnable variable. To capture the long-range spatiotemporal dependencies of a video sequence, StarVQA encodes the space-time position information of each patch to the input of the Transformer. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-VQC, KoNViD-1k, LSVQ, and LSVQ-1080p. Experimental results demonstrate the superiority of the proposed StarVQA over the state-of-the-art. Code and model will be available at: https://github.com/DVL/StarVQA.
CVSep 11, 2019Code
PDANet: Polarity-consistent Deep Attention Network for Fine-grained Visual Emotion RegressionSicheng Zhao, Zizhou Jia, Hui Chen et al.
Existing methods on visual emotion analysis mainly focus on coarse-grained emotion classification, i.e. assigning an image with a dominant discrete emotion category. However, these methods cannot well reflect the complexity and subtlety of emotions. In this paper, we study the fine-grained regression problem of visual emotions based on convolutional neural networks (CNNs). Specifically, we develop a Polarity-consistent Deep Attention Network (PDANet), a novel network architecture that integrates attention into a CNN with an emotion polarity constraint. First, we propose to incorporate both spatial and channel-wise attentions into a CNN for visual emotion regression, which jointly considers the local spatial connectivity patterns along each channel and the interdependency between different channels. Second, we design a novel regression loss, i.e. polarity-consistent regression (PCR) loss, based on the weakly supervised emotion polarity to guide the attention generation. By optimizing the PCR loss, PDANet can generate a polarity preserved attention map and thus improve the emotion regression performance. Extensive experiments are conducted on the IAPS, NAPS, and EMOTIC datasets, and the results demonstrate that the proposed PDANet outperforms the state-of-the-art approaches by a large margin for fine-grained visual emotion regression. Our source code is released at: https://github.com/ZizhouJia/PDANet.
CVMay 7, 2024
Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality AssessmentAobo Li, Jinjian Wu, Yongxu Liu et al.
The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming, especially for authentic images. Training on synthetic data is expected to be beneficial, but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that introducing more distortion types in the synthetic dataset may not improve or even be harmful to generalizing authentic image quality assessment. To solve this challenge, we propose distortion-guided unsupervised domain adaptation for BIQA (DGQA), a novel framework that leverages adaptive multi-domain selection via prior knowledge from distortion to match the data distribution between the source domains and the target domain, thereby reducing negative transfer from the outlier source domains. Extensive experiments on two cross-domain settings (synthetic distortion to authentic distortion and synthetic distortion to algorithmic distortion) have demonstrated the effectiveness of our proposed DGQA. Besides, DGQA is orthogonal to existing model-based BIQA methods, and can be used in combination with such models to improve performance with less training data.
CVFeb 22, 2024
Diffusion Model Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality AssessmentZhaoyang Wang, Bo Hu, Mingyang Zhang et al.
Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods still suffer from finding a balance between learning feature information at the pixel level of the image and capturing high-level feature information and the efficient utilization of the obtained high-level feature information remains a challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enabling a comprehensive understanding of images and possessing a better learning of both high-level and low-level visual features. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. Firstly, we devise a new diffusion restoration network that leverages the produced enhanced image and noise-containing images, incorporating nonlinear features obtained during the denoising process of the diffusion model, as high-level visual information. Secondly, two visual evaluation branches are designed to comprehensively analyze the obtained high-level feature information. These include the visual compensation guidance branch, grounded in the transformer architecture and noise embedding strategy, and the visual difference analysis branch, built on the ResNet architecture and the residual transposed attention block. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA.
CVNov 25, 2024
AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity SimilarityJili Xia, Lihuo He, Fei Gao et al.
Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model's performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.
CVMar 4, 2025
Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image GenerationZhichao Yang, Leida Li, Pengfei Chen et al.
Contrastive vision-language models, such as CLIP, have demonstrated excellent zero-shot capability across semantic recognition tasks, mainly attributed to the training on a large-scale I&1T (one Image with one Text) dataset. This kind of multimodal representations often blend semantic and perceptual elements, placing a particular emphasis on semantics. However, this could be problematic for popular tasks like image quality assessment (IQA) and conditional image generation (CIG), which typically need to have fine control on perceptual and semantic features. Motivated by the above facts, this paper presents a new multimodal disentangled representation learning framework, which leverages disentangled text to guide image disentanglement. To this end, we first build an I&2T (one Image with a perceptual Text and a semantic Text) dataset, which consists of disentangled perceptual and semantic text descriptions for an image. Then, the disentangled text descriptions are utilized as supervisory signals to disentangle pure perceptual representations from CLIP's original `coarse' feature space, dubbed DeCLIP. Finally, the decoupled feature representations are used for both image quality assessment (technical quality and aesthetic quality) and conditional image generation. Extensive experiments and comparisons have demonstrated the advantages of the proposed method on the two popular tasks. The dataset, code, and model will be available.
CVJan 21, 2025
A Multi-annotated and Multi-modal Dataset for Wide-angle Video Quality AssessmentBo Hu, Wei Wang, Chunyi Li et al.
Wide-angle video is favored for its wide viewing angle and ability to capture a large area of scenery, making it an ideal choice for sports and adventure recording. However, wide-angle video is prone to deformation, exposure and other distortions, resulting in poor video quality and affecting the perception and experience, which may seriously hinder its application in fields such as competitive sports. Up to now, few explorations focus on the quality assessment issue of wide-angle video. This deficiency primarily stems from the absence of a specialized dataset for wide-angle videos. To bridge this gap, we construct the first Multi-annotated and multi-modal Wide-angle Video quality assessment (MWV) dataset. Then, the performances of state-of-the-art video quality methods on the MWV dataset are investigated by inter-dataset testing and intra-dataset testing. Experimental results show that these methods impose significant limitations on their applicability.
IVAug 25, 2025
TuningIQA: Fine-Grained Blind Image Quality Assessment for Livestreaming Camera TuningXiangfei Sheng, Zhichao Duan, Xiaofeng Pan et al.
Livestreaming has become increasingly prevalent in modern visual communication, where automatic camera quality tuning is essential for delivering superior user Quality of Experience (QoE). Such tuning requires accurate blind image quality assessment (BIQA) to guide parameter optimization decisions. Unfortunately, the existing BIQA models typically only predict an overall coarse-grained quality score, which cannot provide fine-grained perceptual guidance for precise camera parameter tuning. To bridge this gap, we first establish FGLive-10K, a comprehensive fine-grained BIQA database containing 10,185 high-resolution images captured under varying camera parameter configurations across diverse livestreaming scenarios. The dataset features 50,925 multi-attribute quality annotations and 19,234 fine-grained pairwise preference annotations. Based on FGLive-10K, we further develop TuningIQA, a fine-grained BIQA metric for livestreaming camera tuning, which integrates human-aware feature extraction and graph-based camera parameter fusion. Extensive experiments and comparisons demonstrate that TuningIQA significantly outperforms state-of-the-art BIQA methods in both score regression and fine-grained quality ranking, achieving superior performance when deployed for livestreaming camera tuning.
IVAug 20, 2025
Fine-grained Image Quality Assessment for Perceptual Image RestorationXiangfei Sheng, Xiaofeng Pan, Zhichao Yang et al.
Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://sxfly99.github.io/FGResQ-Homepage.
CVJan 16, 2024
Deep Shape-Texture Statistics for Completely Blind Image Quality EvaluationYixuan Li, Peilin Chen, Hanwei Zhu et al.
Opinion-Unaware Blind Image Quality Assessment (OU-BIQA) models aim to predict image quality without training on reference images and subjective quality scores. Thereinto, image statistical comparison is a classic paradigm, while the performance is limited by the representation ability of visual descriptors. Deep features as visual descriptors have advanced IQA in recent research, but they are discovered to be highly texture-biased and lack of shape-bias. On this basis, we find out that image shape and texture cues respond differently towards distortions, and the absence of either one results in an incomplete image representation. Therefore, to formulate a well-round statistical description for images, we utilize the shapebiased and texture-biased deep features produced by Deep Neural Networks (DNNs) simultaneously. More specifically, we design a Shape-Texture Adaptive Fusion (STAF) module to merge shape and texture information, based on which we formulate qualityrelevant image statistics. The perceptual quality is quantified by the variant Mahalanobis Distance between the inner and outer Shape-Texture Statistics (DSTS), wherein the inner and outer statistics respectively describe the quality fingerprints of the distorted image and natural images. The proposed DSTS delicately utilizes shape-texture statistical relations between different data scales in the deep domain, and achieves state-of-the-art (SOTA) quality prediction performance on images with artificial and authentic distortions.
CVMar 31, 2022
Personalized Image Aesthetics Assessment with Rich AttributesYuzhe Yang, Liwu Xu, Leida Li et al.
Personalized image aesthetics assessment (PIAA) is challenging due to its highly subjective nature. People's aesthetic tastes depend on diversified factors, including image characteristics and subject characters. The existing PIAA databases are limited in terms of annotation diversity, especially the subject aspect, which can no longer meet the increasing demands of PIAA research. To solve the dilemma, we conduct so far, the most comprehensive subjective study of personalized image aesthetics and introduce a new Personalized image Aesthetics database with Rich Attributes (PARA), which consists of 31,220 images with annotations by 438 subjects. PARA features wealthy annotations, including 9 image-oriented objective attributes and 4 human-oriented subjective attributes. In addition, desensitized subject information, such as personality traits, is also provided to support study of PIAA and user portraits. A comprehensive analysis of the annotation data is provided and statistic study indicates that the aesthetic preferences can be mirrored by proposed subjective attributes. We also propose a conditional PIAA model by utilizing subject information as conditional prior. Experimental results indicate that the conditional PIAA model can outperform the control group, which is also the first attempt to demonstrate how image aesthetics and subject characters interact to produce the intricate personalized tastes on image aesthetics. We believe the database and the associated analysis would be useful for conducting next-generation PIAA study. The project page of PARA can be found at: https://cv-datasets.institutecv.com/#/data-sets.
CVDec 15, 2021
Robust Depth Completion with Uncertainty-Driven Loss FunctionsYufan Zhu, Weisheng Dong, Leida Li et al.
Recovering a dense depth image from sparse LiDAR scans is a challenging task. Despite the popularity of color-guided methods for sparse-to-dense depth completion, they treated pixels equally during optimization, ignoring the uneven distribution characteristics in the sparse depth map and the accumulated outliers in the synthesized ground truth. In this work, we introduce uncertainty-driven loss functions to improve the robustness of depth completion and handle the uncertainty in depth completion. Specifically, we propose an explicit uncertainty formulation for robust depth completion with Jeffrey's prior. A parametric uncertain-driven loss is introduced and translated to new loss functions that are robust to noisy or missing data. Meanwhile, we propose a multiscale joint prediction model that can simultaneously predict depth and uncertainty maps. The estimated uncertainty map is also used to perform adaptive prediction on the pixels with high uncertainty, leading to a residual map for refining the completion results. Our method has been tested on KITTI Depth Completion Benchmark and achieved the state-of-the-art robustness performance in terms of MAE, IMAE, and IRMSE metrics.
CVOct 24, 2021
SOLVER: Scene-Object Interrelated Visual Emotion Reasoning NetworkJingyuan Yang, Xinbo Gao, Leida Li et al.
Visual Emotion Analysis (VEA) aims at finding out how people feel emotionally towards different visual stimuli, which has attracted great attention recently with the prevalence of sharing images on social networks. Since human emotion involves a highly complex and abstract cognitive process, it is difficult to infer visual emotions directly from holistic or regional features in affective images. It has been demonstrated in psychology that visual emotions are evoked by the interactions between objects as well as the interactions between objects and scenes within an image. Inspired by this, we propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images. To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features. Then, we conduct reasoning on the Emotion Graph using Graph Convolutional Network (GCN), yielding emotion-enhanced object features. We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism. Extensive experiments and comparisons are conducted on eight public visual emotion datasets, and the results demonstrate that the proposed SOLVER consistently outperforms the state-of-the-art methods by a large margin. Ablation studies verify the effectiveness of our method and visualizations prove its interpretability, which also bring new insight to explore the mysteries in VEA. Notably, we further discuss SOLVER on three other potential datasets with extended experiments, where we validate the robustness of our method and notice some limitations of it.
CVJun 23, 2021
A Circular-Structured Representation for Visual Emotion Distribution LearningJingyuan Yang, Jie Li, Leida Li et al.
Visual Emotion Analysis (VEA) has attracted increasing attention recently with the prevalence of sharing images on social networks. Since human emotions are ambiguous and subjective, it is more reasonable to address VEA in a label distribution learning (LDL) paradigm rather than a single-label classification task. Different from other LDL tasks, there exist intrinsic relationships between emotions and unique characteristics within them, as demonstrated in psychological theories. Inspired by this, we propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning. To be specific, we first construct an Emotion Circle to unify any emotional state within it. On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes (i.e., emotion polarity, emotion type, emotion intensity) as well as two properties (i.e., similarity, additivity). Besides, we design a novel Progressive Circular (PC) loss to penalize the dissimilarities between predicted emotion vector and labeled one in a coarse-to-fine manner, which further boosts the learning process in an emotion-specific way. Extensive experiments and comparisons are conducted on public visual emotion distribution datasets, and the results demonstrate that the proposed method outperforms the state-of-the-art methods.
IVApr 6, 2021
Searching Efficient Model-guided Deep Network for Image DenoisingQian Ning, Weisheng Dong, Xin Li et al.
Neural architecture search (NAS) has recently reshaped our understanding on various vision tasks. Similar to the success of NAS in high-level vision tasks, it is possible to find a memory and computationally efficient solution via NAS with highly competent denoising performance. However, the optimization gap between the super-network and the sub-architectures has remained an open issue in both low-level and high-level vision. In this paper, we present a novel approach to filling in this gap by connecting model-guided design with NAS (MoD-NAS) and demonstrate its application into image denoising. Specifically, we propose to construct a new search space under model-guided framework and develop more stable and efficient differential search strategies. MoD-NAS employs a highly reusable width search strategy and a densely connected search block to automatically select the operations of each layer as well as network width and depth via gradient descent. During the search process, the proposed MoG-NAS is capable of avoiding mode collapse due to the smoother search space designed under the model-guided framework. Experimental results on several popular datasets show that our MoD-NAS has achieved even better PSNR performance than current state-of-the-art methods with fewer parameters, lower number of flops, and less amount of testing time.
CVSep 28, 2020
Cuid: A new study of perceived image quality and its subjective assessmentLucie Lévêque, Ji Yang, Xiaohan Yang et al.
Research on image quality assessment (IQA) remains limited mainly due to our incomplete knowledge about human visual perception. Existing IQA algorithms have been designed or trained with insufficient subjective data with a small degree of stimulus variability. This has led to challenges for those algorithms to handle complexity and diversity of real-world digital content. Perceptual evidence from human subjects serves as a grounding for the development of advanced IQA algorithms. It is thus critical to acquire reliable subjective data with controlled perception experiments that faithfully reflect human behavioural responses to distortions in visual signals. In this paper, we present a new study of image quality perception where subjective ratings were collected in a controlled lab environment. We investigate how quality perception is affected by a combination of different categories of images and different types and levels of distortions. The database will be made publicly available to facilitate calibration and validation of IQA algorithms.
CVSep 14, 2020
Accurate and Lightweight Image Super-Resolution with Model-Guided Deep Unfolding NetworkQian Ning, Weisheng Dong, Guangming Shi et al.
Deep neural networks (DNNs) based methods have achieved great success in single image super-resolution (SISR). However, existing state-of-the-art SISR techniques are designed like black boxes lacking transparency and interpretability. Moreover, the improvement in visual quality is often at the price of increased model complexity due to black-box design. In this paper, we present and advocate an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN). Targeting at breaking the coherence barrier, we opt to work with a well-established image prior named nonlocal auto-regressive model and use it to guide our DNN design. By integrating deep denoising and nonlocal regularization as trainable modules within a deep learning framework, we can unfold the iterative process of model-based SISR into a multi-stage concatenation of building blocks with three interconnected modules (denoising, nonlocal-AR, and reconstruction). The design of all three modules leverages the latest advances including dense/skip connections as well as fast nonlocal implementation. In addition to explainability, MoG-DUN is accurate (producing fewer aliasing artifacts), computationally efficient (with reduced model parameters), and versatile (capable of handling multiple degradations). The superiority of the proposed MoG-DUN method to existing state-of-the-art image SR methods including RCAN, SRMDNF, and SRFBN is substantiated by extensive experiments on several popular datasets and various degradation scenarios.
IVApr 11, 2020
MetaIQA: Deep Meta-learning for No-Reference Image Quality AssessmentHancheng Zhu, Leida Li, Jinjian Wu et al.
Recently, increasing interest has been drawn in exploiting deep convolutional neural networks (DCNNs) for no-reference image quality assessment (NR-IQA). Despite of the notable success achieved, there is a broad consensus that training DCNNs heavily relies on massive annotated data. Unfortunately, IQA is a typical small sample problem. Therefore, most of the existing DCNN-based IQA metrics operate based on pre-trained networks. However, these pre-trained networks are not designed for IQA task, leading to generalization problem when evaluating different types of distortions. With this motivation, this paper presents a no-reference IQA metric based on deep meta-learning. The underlying idea is to learn the meta-knowledge shared by human when evaluating the quality of images with various distortions, which can then be adapted to unknown distortions easily. Specifically, we first collect a number of NR-IQA tasks for different distortions. Then meta-learning is adopted to learn the prior knowledge shared by diversified distortions. Finally, the quality prior model is fine-tuned on a target NR-IQA task for quickly obtaining the quality model. Extensive experiments demonstrate that the proposed metric outperforms the state-of-the-arts by a large margin. Furthermore, the meta-model learned from synthetic distortions can also be easily generalized to authentic distortions, which is highly desired in real-world applications of IQA metrics.
CVJun 2, 2019
Incremental Few-Shot Learning for Pedestrian Attribute RecognitionLiuyu Xiang, Xiaoming Jin, Guiguang Ding et al.
Pedestrian attribute recognition has received increasing attention due to its important role in video surveillance applications. However, most existing methods are designed for a fixed set of attributes. They are unable to handle the incremental few-shot learning scenario, i.e. adapting a well-trained model to newly added attributes with scarce data, which commonly exists in the real world. In this work, we present a meta learning based method to address this issue. The core of our framework is a meta architecture capable of disentangling multiple attribute information and generalizing rapidly to new coming attributes. By conducting extensive experiments on the benchmark dataset PETA and RAP under the incremental few-shot setting, we show that our method is able to perform the task with competitive performances and low resource requirements.
CVMay 7, 2019
Semantic Adversarial Network for Zero-Shot Sketch-Based Image RetrievalXinxun Xu, Hao Wang, Leida Li et al.
Zero-shot sketch-based image retrieval (ZS-SBIR) is a specific cross-modal retrieval task for retrieving natural images with free-hand sketches under zero-shot scenario. Previous works mostly focus on modeling the correspondence between images and sketches or synthesizing image features with sketch features. However, both of them ignore the large intra-class variance of sketches, thus resulting in unsatisfactory retrieval performance. In this paper, we propose a novel end-to-end semantic adversarial approach for ZS-SBIR. Specifically, we devise a semantic adversarial module to maximize the consistency between learned semantic features and category-level word vectors. Moreover, to preserve the discriminability of synthesized features within each training category, a triplet loss is employed for the generative module. Additionally, the proposed model is trained in an end-to-end strategy to exploit better semantic features suitable for ZS-SBIR. Extensive experiments conducted on two large-scale popular datasets demonstrate that our proposed approach remarkably outperforms state-of-the-art approaches by more than 12\% on Sketchy dataset and about 3\% on TU-Berlin dataset in the retrieval.
CVAug 13, 2017
Image Quality Assessment Guided Deep Neural Networks TrainingZhuo Chen, Weisi Lin, Shiqi Wang et al.
For many computer vision problems, the deep neural networks are trained and validated based on the assumption that the input images are pristine (i.e., artifact-free). However, digital images are subject to a wide range of distortions in real application scenarios, while the practical issues regarding image quality in high level visual information understanding have been largely ignored. In this paper, in view of the fact that most widely deployed deep learning models are susceptible to various image distortions, the distorted images are involved for data augmentation in the deep neural network training process to learn a reliable model for practical applications. In particular, an image quality assessment based label smoothing method, which aims at regularizing the label distribution of training images, is further proposed to tune the objective functions in learning the neural network. Experimental results show that the proposed method is effective in dealing with both low and high quality images in the typical image classification task.