CVFeb 22, 2023Code
Gap-closing Matters: Perceptual Quality Evaluation and Optimization of Low-Light Image EnhancementBaoliang Chen, Lingyu Zhu, Hanwei Zhu et al.
There is a growing consensus in the research community that the optimization of low-light image enhancement approaches should be guided by the visual quality perceived by end users. Despite the substantial efforts invested in the design of low-light enhancement algorithms, there has been comparatively limited focus on assessing subjective and objective quality systematically. To mitigate this gap and provide a clear path towards optimizing low-light image enhancement for better visual quality, we propose a gap-closing framework. In particular, our gap-closing framework starts with the creation of a large-scale dataset for Subjective QUality Assessment of REconstructed LOw-Light Images (SQUARE-LOL). This database serves as the foundation for studying the quality of enhanced images and conducting a comprehensive subjective user study. Subsequently, we propose an objective quality assessment measure that plays a critical role in bridging the gap between visual quality and enhancement. Finally, we demonstrate that our proposed objective quality measure can be incorporated into the process of optimizing the learning of the enhancement model toward perceptual optimality. We validate the effectiveness of our proposed framework through both the accuracy of quality prediction and the perceptual quality of image enhancement. Our database and codes are publicly available at https://github.com/Baoliang93/IACA_For_Lowlight_IQA.
IVApr 14, 2023Code
Perceptual Quality Assessment of Face Video Compression: A Benchmark and An Effective MethodYixuan Li, Bolin Chen, Baoliang Chen et al.
Recent years have witnessed an exponential increase in the demand for face video compression, and the success of artificial intelligence has expanded the boundaries beyond traditional hybrid video coding. Generative coding approaches have been identified as promising alternatives with reasonable perceptual rate-distortion trade-offs, leveraging the statistical priors of face videos. However, the great diversity of distortion types in spatial and temporal domains, ranging from the traditional hybrid coding frameworks to generative models, present grand challenges in compressed face video quality assessment (VQA). In this paper, we introduce the large-scale Compressed Face Video Quality Assessment (CFVQA) database, which is the first attempt to systematically understand the perceptual quality and diversified compression distortions in face videos. The database contains 3,240 compressed face video clips in multiple compression levels, which are derived from 135 source videos with diversified content using six representative video codecs, including two traditional methods based on hybrid coding frameworks, two end-to-end methods, and two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index for face video compression was developed to measure the perceptual quality, considering the distinct content characteristics and temporal priors of the face videos. Experimental results exhibit its superior performance on the proposed CFVQA dataset. The benchmark is now made publicly available at: https://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment.
CVNov 9, 2022Code
DeepDC: Deep Distance Correlation as a Perceptual Image Quality EvaluatorHanwei Zhu, Baoliang Chen, Lingyu Zhu et al.
ImageNet pre-trained deep neural networks (DNNs) show notable transferability for building effective image quality assessment (IQA) models. Such a remarkable byproduct has often been identified as an emergent property in previous studies. In this work, we attribute such capability to the intrinsic texture-sensitive characteristic that classifies images using texture features. We fully exploit this characteristic to develop a novel full-reference IQA (FR-IQA) model based exclusively on pre-trained DNN features. Specifically, we compute the distance correlation, a highly promising yet relatively under-investigated statistic, between reference and distorted images in the deep feature domain. In addition, the distance correlation quantifies both linear and nonlinear feature relationships, which is far beyond the widely used first-order and second-order statistics in the feature space. We conduct comprehensive experiments to demonstrate the superiority of the proposed quality model on five standard IQA datasets, one perceptual similarity dataset, two texture similarity datasets, and one geometric transformation dataset. Moreover, we optimize the proposed model to generate a broad spectrum of texture patterns, by treating the model as the style loss function for neural style transfer (NST). Extensive experiments demonstrate that the proposed texture synthesis and NST methods achieve the best quantitative and qualitative results. We release our code at https://github.com/h4nwei/DeepDC.
CVSep 21, 2022Code
Learning from Mixed Datasets: A Monotonic Image Quality Assessment ModelZhaopeng Feng, Keyang Zhang, Shuyue Jia et al.
Deep learning based image quality assessment (IQA) models usually learn to predict image quality from a single dataset, leading the model to overfit specific scenes. To account for this, mixed datasets training can be an effective way to enhance the generalization capability of the model. However, it is nontrivial to combine different IQA datasets, as their quality evaluation criteria, score ranges, view conditions, as well as subjects are usually not shared during the image quality annotation. In this paper, instead of aligning the annotations, we propose a monotonic neural network for IQA model learning with different datasets combined. In particular, our model consists of a dataset-shared quality regressor and several dataset-specific quality transformers. The quality regressor aims to obtain the perceptual qualities of each dataset while each quality transformer maps the perceptual qualities to the corresponding dataset annotations with their monotonicity maintained. The experimental results verify the effectiveness of the proposed learning strategy and our code is available at https://github.com/fzp0424/MonotonicIQA.
CVSep 12, 2022Code
Deep Feature Statistics Mapping for Generalized Screen Content Image Quality AssessmentBaoliang Chen, Hanwei Zhu, Lingyu Zhu et al.
The statistical regularities of natural images, referred to as natural scene statistics, play an important role in no-reference image quality assessment. However, it has been widely acknowledged that screen content images (SCIs), which are typically computer generated, do not hold such statistics. Here we make the first attempt to learn the statistics of SCIs, based upon which the quality of SCIs can be effectively determined. The underlying mechanism of the proposed approach is based upon the mild assumption that the SCIs, which are not physically acquired, still obey certain statistics that could be understood in a learning fashion. We empirically show that the statistics deviation could be effectively leveraged in quality assessment, and the proposed method is superior when evaluated in different settings. Extensive experimental results demonstrate the Deep Feature Statistics based SCI Quality Assessment (DFSS-IQA) model delivers promising performance compared with existing NR-IQA models and shows a high generalization capability in the cross-dataset settings. The implementation of our method is publicly available at https://github.com/Baoliang93/DFSS-IQA.
CVSep 6, 2024Code
RCNet: Deep Recurrent Collaborative Network for Multi-View Low-Light Image EnhancementHao Luo, Baoliang Chen, Lingyu Zhu et al.
Scene observation from multiple perspectives would bring a more comprehensive visual experience. However, in the context of acquiring multiple views in the dark, the highly correlated views are seriously alienated, making it challenging to improve scene understanding with auxiliary views. Recent single image-based enhancement methods may not be able to provide consistently desirable restoration performance for all views due to the ignorance of potential feature correspondence among different views. To alleviate this issue, we make the first attempt to investigate multi-view low-light image enhancement. First, we construct a new dataset called Multi-View Low-light Triplets (MVLT), including 1,860 pairs of triple images with large illumination ranges and wide noise distribution. Each triplet is equipped with three different viewpoints towards the same scene. Second, we propose a deep multi-view enhancement framework based on the Recurrent Collaborative Network (RCNet). Specifically, in order to benefit from similar texture correspondence across different views, we design the recurrent feature enhancement, alignment and fusion (ReEAF) module, in which intra-view feature enhancement (Intra-view EN) followed by inter-view feature alignment and fusion (Inter-view AF) is performed to model the intra-view and inter-view feature propagation sequentially via multi-view collaboration. In addition, two different modules from enhancement to alignment (E2A) and from alignment to enhancement (A2E) are developed to enable the interactions between Intra-view EN and Inter-view AF, which explicitly utilize attentive feature weighting and sampling for enhancement and alignment, respectively. Experimental results demonstrate that our RCNet significantly outperforms other state-of-the-art methods. All of our dataset, code, and model will be available at https://github.com/hluo29/RCNet.
CVAug 19, 2024Code
Sliced Maximal Information Coefficient: A Training-Free Approach for Image Quality Assessment EnhancementKang Xiao, Xu Wang, Yulin He et al.
Full-reference image quality assessment (FR-IQA) models generally operate by measuring the visual differences between a degraded image and its reference. However, existing FR-IQA models including both the classical ones (eg, PSNR and SSIM) and deep-learning based measures (eg, LPIPS and DISTS) still exhibit limitations in capturing the full perception characteristics of the human visual system (HVS). In this paper, instead of designing a new FR-IQA measure, we aim to explore a generalized human visual attention estimation strategy to mimic the process of human quality rating and enhance existing IQA models. In particular, we model human attention generation by measuring the statistical dependency between the degraded image and the reference image. The dependency is captured in a training-free manner by our proposed sliced maximal information coefficient and exhibits surprising generalization in different IQA measures. Experimental results verify the performance of existing IQA models can be consistently improved when our attention module is incorporated. The source code is available at https://github.com/KANGX99/SMIC.
CVJan 29Code
From Global to Granular: Revealing IQA Model Performance via Correlation SurfaceBaoliang Chen, Danni Huang, Hanwei Zhu et al.
Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to $|Δ$MOS$|$). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbf{Granularity-Modulated Correlation (GMC)}, which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbf{Granularity Modulator} that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ($|Δ$MOS$|$) to examine local performance variations, and (2) a \textbf{Distribution Regulator} that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbf{correlation surface} maps correlation values as a joint function of MOS and $|Δ$MOS$|$, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.
IVAug 5, 2022
DeepWSD: Projecting Degradations in Perceptual Space to Wasserstein Distance in Deep Feature SpaceXingran Liao, Baoliang Chen, Hanwei Zhu et al.
Existing deep learning-based full-reference IQA (FR-IQA) models usually predict the image quality in a deterministic way by explicitly comparing the features, gauging how severely distorted an image is by how far the corresponding feature lies from the space of the reference images. Herein, we look at this problem from a different viewpoint and propose to model the quality degradation in perceptual space from a statistical distribution perspective. As such, the quality is measured based upon the Wasserstein distance in the deep feature domain. More specifically, the 1DWasserstein distance at each stage of the pre-trained VGG network is measured, based on which the final quality score is performed. The deep Wasserstein distance (DeepWSD) performed on features from neural networks enjoys better interpretability of the quality contamination caused by various types of distortions and presents an advanced quality prediction capability. Extensive experiments and theoretical analysis show the superiority of the proposed DeepWSD in terms of both quality prediction and optimization.
CVNov 14, 2025Code
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language ModelsJiaxi Huang, Dongxu Wu, Hanwei Zhu et al.
The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.
CVSep 13, 2022
Just Noticeable Difference Modeling for Face Recognition SystemYu Tian, Zhangkai Ni, Baoliang Chen et al.
High-quality face images are required to guarantee the stability and reliability of automatic face recognition (FR) systems in surveillance and security scenarios. However, a massive amount of face data is usually compressed before being analyzed due to limitations on transmission or storage. The compressed images may lose the powerful identity information, resulting in the performance degradation of the FR system. Herein, we make the first attempt to study just noticeable difference (JND) for the FR system, which can be defined as the maximum distortion that the FR system cannot notice. More specifically, we establish a JND dataset including 3530 original images and 137,670 compressed images generated by advanced reference encoding/decoding software based on the Versatile Video Coding (VVC) standard (VTM-15.0). Subsequently, we develop a novel JND prediction model to directly infer JND images for the FR system. In particular, in order to maximum redundancy removal without impairment of robust identity information, we apply the encoder with multiple feature extraction and attention-based feature decomposition modules to progressively decompose face features into two uncorrelated components, i.e., identity and residual features, via self-supervised learning. Then, the residual feature is fed into the decoder to generate the residual map. Finally, the predicted JND map is obtained by subtracting the residual map from the original image. Experimental results have demonstrated that the proposed model achieves higher accuracy of JND map prediction compared with the state-of-the-art JND models, and is capable of saving more bits while maintaining the performance of the FR system compared with VTM-15.0.
CVAug 22, 2024
Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video EnhancementLingyu Zhu, Wenhan Yang, Baoliang Chen et al.
Obtaining pairs of low/normal-light videos, with motions, is more challenging than still images, which raises technical issues and poses the technical route of unpaired learning as a critical role. This paper makes endeavors in the direction of learning for low-light video enhancement without using paired ground truth. Compared to low-light image enhancement, enhancing low-light videos is more difficult due to the intertwined effects of noise, exposure, and contrast in the spatial domain, jointly with the need for temporal coherence. To address the above challenge, we propose the Unrolled Decomposed Unpaired Network (UDU-Net) for enhancing low-light videos by unrolling the optimization functions into a deep network to decompose the signal into spatial and temporal-related factors, which are updated iteratively. Firstly, we formulate low-light video enhancement as a Maximum A Posteriori estimation (MAP) problem with carefully designed spatial and temporal visual regularization. Then, via unrolling the problem, the optimization of the spatial and temporal constraints can be decomposed into different steps and updated in a stage-wise manner. From the spatial perspective, the designed Intra subnet leverages unpair prior information from expert photography retouched skills to adjust the statistical distribution. Additionally, we introduce a novel mechanism that integrates human perception feedback to guide network optimization, suppressing over/under-exposure conditions. Meanwhile, to address the issue from the temporal perspective, the designed Inter subnet fully exploits temporal cues in progressive optimization, which helps achieve improved temporal consistency in enhancement results. Consequently, the proposed method achieves superior performance to state-of-the-art methods in video illumination, noise suppression, and temporal consistency across outdoor and indoor scenes.
CVFeb 2, 2024Code
2AFC Prompting of Large Multimodal Models for Image Quality AssessmentHanwei Zhu, Xiangjie Sui, Baoliang Chen et al.
While abundant research has been conducted on improving high-level visual understanding and reasoning capabilities of large multimodal models~(LMMs), their visual quality assessment~(IQA) ability has been relatively under-explored. Here we take initial steps towards this goal by employing the two-alternative forced choice~(2AFC) prompting, as 2AFC is widely regarded as the most reliable way of collecting human opinions of visual quality. Subsequently, the global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation. Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and correlation, to provide comprehensive quantifications and deeper insights into the IQA capability of five LMMs. Extensive experiments show that existing LMMs exhibit remarkable IQA ability on coarse-grained quality comparison, but there is room for improvement on fine-grained quality discrimination. The proposed dataset sheds light on the future development of IQA models based on LMMs. The codes will be made publicly available at https://github.com/h4nwei/2AFC-LMMs.
CVDec 28, 2025
Plug In, Grade Right: Psychology-Inspired AGIQAZhicheng Liao, Baoliang Chen, Hanwei Zhu et al.
Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both "excellent" and "poor" grade descriptions while deviating from the "good" one. We refer to this phenomenon as "semantic drift", where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject's ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image's ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.
CVMar 3
EduVQA: Benchmarking AI-Generated Video Quality Assessment for EducationBaoliang Chen, Xinlong Bu, Lingyu Zhu et al.
While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.
CVNov 13, 2025
Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality AssessmentZhicheng Liao, Dongxu Wu, Zhenshan Shi et al.
Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.
CVNov 17, 2025Code
Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from DrawingsZihao Lin, Zhenshan Shi, Sasa Zhao et al.
Assessing human creativity through visual outputs, such as drawings, plays a critical role in fields including psychology, education, and cognitive science. However, current assessment practices still rely heavily on expert-based subjective scoring, which is both labor-intensive and inherently subjective. In this paper, we propose a data-driven framework for automatic and interpretable creativity assessment from drawings. Motivated by the cognitive evidence proposed in [6] that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions. Specifically, we first augment an existing creativity-labeled dataset with additional annotations targeting content categories. Based on the enriched dataset, we further propose a conditional model predicting content, style, and ratings simultaneously. In particular, the conditional learning mechanism that enables the model to adapt its visual feature extraction by dynamically tuning it to creativity-relevant signals conditioned on the drawing's stylistic and semantic cues. Experimental results demonstrate that our model achieves state-of-the-art performance compared to existing regression-based approaches and offers interpretable visualizations that align well with human judgments. The code and annotations will be made publicly available at https://github.com/WonderOfU9/CSCA_PRCV_2025
IVAug 9, 2021Code
No-Reference Image Quality Assessment by Hallucinating Pristine FeaturesBaoliang Chen, Lingyu Zhu, Chenqi Kong et al.
In this paper, we propose a no-reference (NR) image quality assessment (IQA) method via feature level pseudo-reference (PR) hallucination. The proposed quality assessment framework is grounded on the prior models of natural image statistical behaviors and rooted in the view that the perceptually meaningful features could be well exploited to characterize the visual quality. Herein, the PR features from the distorted images are learned by a mutual learning scheme with the pristine reference as the supervision, and the discriminative characteristics of PR features are further ensured with the triplet constraints. Given a distorted image for quality inference, the feature level disentanglement is performed with an invertible neural layer for final quality prediction, leading to the PR and the corresponding distortion features for comparison. The effectiveness of our proposed method is demonstrated on four popular IQA databases, and superior performance on cross-database evaluation also reveals the high generalization capability of our method. The implementation of our method is publicly available on https://github.com/Baoliang93/FPR.
CVDec 20, 2024
AI-generated Image Quality Assessment in Visual CommunicationYu Tian, Yixuan Li, Baoliang Chen et al.
Assessing the quality of artificial intelligence-generated images (AIGIs) plays a crucial role in their application in real-world scenarios. However, traditional image quality assessment (IQA) algorithms primarily focus on low-level visual perception, while existing IQA works on AIGIs overemphasize the generated content itself, neglecting its effectiveness in real-world applications. To bridge this gap, we propose AIGI-VC, a quality assessment database for AI-Generated Images in Visual Communication, which studies the communicability of AIGIs in the advertising field from the perspectives of information clarity and emotional interaction. The dataset consists of 2,500 images spanning 14 advertisement topics and 8 emotion types. It provides coarse-grained human preference annotations and fine-grained preference descriptions, benchmarking the abilities of IQA methods in preference prediction, interpretation, and reasoning. We conduct an empirical study of existing representative IQA methods and large multi-modal models on the AIGI-VC dataset, uncovering their strengths and weaknesses.
CVNov 19, 2024
Mitigating Perception Bias: A Training-Free Approach to Enhance LMM for Image Quality AssessmentBaoliang Chen, Siyi Pan, Dongxu Wu et al.
Despite the impressive performance of large multimodal models (LMMs) in high-level visual tasks, their capacity for image quality assessment (IQA) remains limited. One main reason is that LMMs are primarily trained for high-level tasks (e.g., image captioning), emphasizing unified image semantics extraction under varied quality. Such semantic-aware yet quality-insensitive perception bias inevitably leads to a heavy reliance on image semantics when those LMMs are forced for quality rating. In this paper, instead of retraining or tuning an LMM costly, we propose a training-free debiasing framework, in which the image quality prediction is rectified by mitigating the bias caused by image semantics. Specifically, we first explore several semantic-preserving distortions that can significantly degrade image quality while maintaining identifiable semantics. By applying these specific distortions to the query or test images, we ensure that the degraded images are recognized as poor quality while their semantics mainly remain. During quality inference, both a query image and its corresponding degraded version are fed to the LMM along with a prompt indicating that the query image quality should be inferred under the condition that the degraded one is deemed poor quality. This prior condition effectively aligns the LMM's quality perception, as all degraded images are consistently rated as poor quality, regardless of their semantic variance. Finally, the quality scores of the query image inferred under different prior conditions (degraded versions) are aggregated using a conditional probability model. Extensive experiments on various IQA datasets show that our debiasing framework could consistently enhance the LMM performance.
CVNov 24, 2025
Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment MetricXiangjie Sui, Songyang Li, Hanwei Zhu et al.
Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
CVSep 30, 2025
AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality AssessmentHanwei Zhu, Yu Tian, Keyan Ding et al.
Image quality assessment (IQA) is inherently complex, as it reflects both the quantification and interpretation of perceptual quality rooted in the human visual system. Conventional approaches typically rely on fixed models to output scalar scores, limiting their adaptability to diverse distortions, user-specific queries, and interpretability needs. Furthermore, scoring and interpretation are often treated as independent processes, despite their interdependence: interpretation identifies perceptual degradations, while scoring abstracts them into a compact metric. To address these limitations, we propose AgenticIQA, a modular agentic framework that integrates vision-language models (VLMs) with traditional IQA tools in a dynamic, query-aware manner. AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution -- coordinated by a planner, executor, and summarizer. The planner formulates task-specific strategies, the executor collects perceptual evidence via tool invocation, and the summarizer integrates this evidence to produce accurate scores with human-aligned explanations. To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents. Extensive experiments across diverse IQA datasets demonstrate that AgenticIQA consistently surpasses strong baselines in both scoring accuracy and explanatory alignment.
LGAug 7, 2025
Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment AnalysisMenghua Jiang, Yuxia Lin, Baoliang Chen et al.
Multimodal sentiment analysis (MSA) aims to understand human emotions by integrating information from multiple modalities, such as text, audio, and visual data. However, existing methods often suffer from spurious correlations both within and across modalities, leading models to rely on statistical shortcuts rather than true causal relationships, thereby undermining generalization. To mitigate this issue, we propose a Multi-relational Multimodal Causal Intervention (MMCI) model, which leverages the backdoor adjustment from causal theory to address the confounding effects of such shortcuts. Specifically, we first model the multimodal inputs as a multi-relational graph to explicitly capture intra- and inter-modal dependencies. Then, we apply an attention mechanism to separately estimate and disentangle the causal features and shortcut features corresponding to these intra- and inter-modal relations. Finally, by applying the backdoor adjustment, we stratify the shortcut features and dynamically combine them with the causal features to encourage MMCI to produce stable predictions under distribution shifts. Extensive experiments on several standard MSA datasets and out-of-distribution (OOD) test sets demonstrate that our method effectively suppresses biases and improves performance.
IVFeb 20, 2022
The Loop Game: Quality Assessment and Optimization for Low-Light Image EnhancementDanni Huang, Lingyu Zhu, Zihao Lin et al.
There is an increasing consensus that the design and optimization of low light image enhancement methods need to be fully driven by perceptual quality. With numerous approaches proposed to enhance low-light images, much less work has been dedicated to quality assessment and quality optimization of low-light enhancement. In this paper, to close the gap between enhancement and assessment, we propose a loop enhancement framework that produces a clear picture of how the enhancement of low-light images could be optimized towards better visual quality. In particular, we create a large-scale database for QUality assessment Of The Enhanced LOw-Light Image (QUOTE-LOL), which serves as the foundation in studying and developing objective quality assessment measures. The objective quality assessment measure plays a critical bridging role between visual quality and enhancement and is further incorporated in the optimization in learning the enhancement model towards perceptual optimally. Finally, we iteratively perform the enhancement and optimization tasks, enhancing the low-light images continuously. The superiority of the proposed scheme is validated based on various low-light scenes.
CVJan 28, 2022
Generalized Visual Quality Assessment of GAN-Generated Face ImagesYu Tian, Zhangkai Ni, Baoliang Chen et al.
Recent years have witnessed the dramatically increased interest in face generation with generative adversarial networks (GANs). A number of successful GAN algorithms have been developed to produce vivid face images towards different application scenarios. However, little work has been dedicated to automatic quality assessment of such GAN-generated face images (GFIs), even less have been devoted to generalized and robust quality assessment of GFIs generated with unseen GAN model. Herein, we make the first attempt to study the subjective and objective quality towards generalized quality assessment of GFIs. More specifically, we establish a large-scale database consisting of GFIs from four GAN algorithms, the pseudo labels from image quality assessment (IQA) measures, as well as the human opinion scores via subjective testing. Subsequently, we develop a quality assessment model that is able to deliver accurate quality predictions for GFIs from both available and unseen GAN algorithms based on meta-learning. In particular, to learn shared knowledge from GFIs pairs that are born of limited GAN algorithms, we develop the convolutional block attention (CBA) and facial attributes-based analysis (ABA) modules, ensuring that the learned knowledge tends to be consistent with human visual perception. Extensive experiments exhibit that the proposed model achieves better performance compared with the state-of-the-art IQA models, and is capable of retaining the effectiveness when evaluating GFIs from the unseen GAN algorithms.
CVJul 13, 2021
Detect and Locate: Exposing Face Manipulation by Semantic- and Noise-level TelltalesChenqi Kong, Baoliang Chen, Haoliang Li et al.
The technological advancements of deep learning have enabled sophisticated face manipulation schemes, raising severe trust issues and security concerns in modern society. Generally speaking, detecting manipulated faces and locating the potentially altered regions are challenging tasks. Herein, we propose a conceptually simple but effective method to efficiently detect forged faces in an image while simultaneously locating the manipulated regions. The proposed scheme relies on a segmentation map that delivers meaningful high-level semantic information clues about the image. Furthermore, a noise map is estimated, playing a complementary role in capturing low-level clues and subsequently empowering decision-making. Finally, the features from these two modules are combined to distinguish fake faces. Extensive experiments show that the proposed model achieves state-of-the-art detection accuracy and remarkable localization performance.
CVJan 25, 2021
Camera Invariant Feature Learning for Generalized Face Anti-spoofingBaoliang Chen, Wenhan Yang, Haoliang Li et al.
There has been an increasing consensus in learning based face anti-spoofing that the divergence in terms of camera models is causing a large domain gap in real application scenarios. We describe a framework that eliminates the influence of inherent variance from acquisition cameras at the feature level, leading to the generalized face spoofing detection model that could be highly adaptive to different acquisition devices. In particular, the framework is composed of two branches. The first branch aims to learn the camera invariant spoofing features via feature level decomposition in the high frequency domain. Motivated by the fact that the spoofing features exist not only in the high frequency domain, in the second branch the discrimination capability of extracted spoofing features is further boosted from the enhanced image based on the recomposition of the high-frequency and low-frequency information. Finally, the classification results of the two branches are fused together by a weighting strategy. Experiments show that the proposed method can achieve better performance in both intra-dataset and cross-dataset settings, demonstrating the high generalization capability in various application scenarios.
IVDec 27, 2020
Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality AssessmentBaoliang Chen, Lingyu Zhu, Guo Li et al.
In this work, we propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction. In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain. In the spatial domain, to tackle the resolution and content variations, we impose the Gaussian distribution constraints on the quality features. The unified distribution can significantly reduce the domain gap between different video samples, resulting in a more generalized quality feature representation. Along the temporal dimension, inspired by the mechanism of visual perception, we propose a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality. Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings, and achieves comparable performance on intra-dataset configurations, demonstrating the high-generalization capability of the proposed method.
CVAug 19, 2020
No-reference Screen Content Image Quality Assessment with Unsupervised Domain AdaptationBaoliang Chen, Haoliang Li, Hongfei Fan et al.
In this paper, we quest the capability of transferring the quality of natural scene images to the images that are not acquired by optical cameras (e.g., screen content images, SCIs), rooted in the widely accepted view that the human visual system has adapted and evolved through the perception of natural environment. Here, we develop the first unsupervised domain adaptation based no reference quality assessment method for SCIs, leveraging rich subjective ratings of the natural images (NIs). In general, it is a non-trivial task to directly transfer the quality prediction model from NIs to a new type of content (i.e., SCIs) that holds dramatically different statistical characteristics. Inspired by the transferability of pair-wise relationship, the proposed quality measure operates based on the philosophy of improving the transferability and discriminability simultaneously. In particular, we introduce three types of losses which complementarily and explicitly regularize the feature space of ranking in a progressive manner. Regarding feature discriminatory capability enhancement, we propose a center based loss to rectify the classifier and improve its prediction capability not only for source domain (NI) but also the target domain (SCI). For feature discrepancy minimization, the maximum mean discrepancy (MMD) is imposed on the extracted ranking features of NIs and SCIs. Furthermore, to further enhance the feature diversity, we introduce the correlation penalization between different feature dimensions, leading to the features with lower rank and higher diversity. Experiments show that our method can achieve higher performance on different source-target settings based on a light-weight convolution neural network. The proposed method also sheds light on learning quality assessment measures for unseen application-specific content without the cumbersome and costing subjective evaluations.