CVNov 9, 2022Code
Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical PerspectivesHaoning Wu, Erli Zhang, Liang Liao et al.
The rapid increase in user-generated-content (UGC) videos calls for the development of effective video quality assessment (VQA) algorithms. However, the objective of the UGC-VQA problem is still ambiguous and can be viewed from two perspectives: the technical perspective, measuring the perception of distortions; and the aesthetic perspective, which relates to preference and recommendation on contents. To understand how these two perspectives affect overall subjective opinions in UGC-VQA, we conduct a large-scale subjective study to collect human quality opinions on overall quality of videos as well as perceptions from aesthetic and technical perspectives. The collected Disentangled Video Quality Database (DIVIDE-3k) confirms that human quality opinions on UGC videos are universally and inevitably affected by both aesthetic and technical perspectives. In light of this, we propose the Disentangled Objective Video Quality Evaluator (DOVER) to learn the quality of UGC videos based on the two perspectives. The DOVER proves state-of-the-art performance in UGC-VQA under very high efficiency. With perspective opinions in DIVIDE-3k, we further propose DOVER++, the first approach to provide reliable clear-cut quality evaluations from a single aesthetic or technical perspective. Code at https://github.com/VQAssessment/DOVER.
CVJul 6, 2022Code
FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment SamplingHaoning Wu, Chaofeng Chen, Jingwen Hou et al.
Current deep video quality assessment (VQA) methods are usually with high computational costs when evaluating high-resolution videos. This cost hinders them from learning better video-quality-related representations via end-to-end training. Existing approaches typically consider naive sampling to reduce the computational cost, such as resizing and cropping. However, they obviously corrupt quality-related information in videos and are thus not optimal for learning good representations for VQA. Therefore, there is an eager need to design a new quality-retained sampling scheme for VQA. In this paper, we propose Grid Mini-patch Sampling (GMS), which allows consideration of local quality by sampling patches at their raw resolution and covers global quality with contextual relations via mini-patches sampled in uniform grids. These mini-patches are spliced and aligned temporally, named as fragments. We further build the Fragment Attention Network (FANet) specially designed to accommodate fragments as inputs. Consisting of fragments and FANet, the proposed FrAgment Sample Transformer for VQA (FAST-VQA) enables efficient end-to-end deep VQA and learns effective video-quality-related representations. It improves state-of-the-art accuracy by around 10% while reducing 99.5% FLOPs on 1080P high-resolution videos. The newly learned video-quality-related representations can also be transferred into smaller VQA datasets, boosting performance in these scenarios. Extensive experiments show that FAST-VQA has good performance on inputs of various resolutions while retaining high efficiency. We publish our code at https://github.com/timothyhtimothy/FAST-VQA.
CVAug 6, 2023Code
TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality AssessmentChaofeng Chen, Jiadi Mo, Jingwen Hou et al.
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. Inspired by the characteristics of the human visual system, existing methods typically use a combination of global and local representations (\ie, multi-scale features) to achieve superior performance. However, most of them adopt simple linear fusion of multi-scale features, and neglect their possibly complex relationship and interaction. In contrast, humans typically first form a global impression to locate important regions and then focus on local details in those regions. We therefore propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions, named as \emph{TOPIQ}. Our approach to IQA involves the design of a heuristic coarse-to-fine network (CFANet) that leverages multi-scale features and progressively propagates multi-level semantic information to low-level representations in a top-down manner. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features guided by higher level features. This mechanism emphasizes active semantic regions for low-level distortions, thereby improving performance. CFANet can be used for both Full-Reference (FR) and No-Reference (NR) IQA. We use ResNet50 as its backbone and demonstrate that CFANet achieves better or competitive performance on most public FR and NR benchmarks compared with state-of-the-art methods based on vision transformers, while being much more efficient (with only ${\sim}13\%$ FLOPS of the current best FR method). Codes are released at \url{https://github.com/chaofengc/IQA-PyTorch}.
CVJun 7, 2023Code
AGIQA-3K: An Open Database for AI-Generated Image Quality AssessmentChunyi Li, Zicheng Zhang, Haoning Wu et al.
With the rapid advancements of the text-to-image generative model, AI-generated images (AGIs) have been widely applied to entertainment, education, social media, etc. However, considering the large quality variance among different AGIs, there is an urgent need for quality models that are consistent with human subjective ratings. To address this issue, we extensively consider various popular AGI models, generated AGI through different prompts and model parameters, and collected subjective scores at the perceptual quality and text-to-image alignment, thus building the most comprehensive AGI subjective quality database AGIQA-3K so far. Furthermore, we conduct a benchmark experiment on this database to evaluate the consistency between the current Image Quality Assessment (IQA) model and human perception, while proposing StairReward that significantly improves the assessment performance of subjective text-to-image alignment. We believe that the fine-grained subjective scores in AGIQA-3K will inspire subsequent AGI quality models to fit human subjective perception mechanisms at both perception and alignment levels and to optimize the generation result of future AGI models. The database is released on https://github.com/lcysyzxdxc/AGIQA-3k-Database.
CVMar 25, 2022Code
Adjacent Context Coordination Network for Salient Object Detection in Optical Remote Sensing ImagesGongyang Li, Zhi Liu, Dan Zeng et al.
Salient object detection (SOD) in optical remote sensing images (RSIs), or RSI-SOD, is an emerging topic in understanding optical RSIs. However, due to the difference between optical RSIs and natural scene images (NSIs), directly applying NSI-SOD methods to optical RSIs fails to achieve satisfactory results. In this paper, we propose a novel Adjacent Context Coordination Network (ACCoNet) to explore the coordination of adjacent features in an encoder-decoder architecture for RSI-SOD. Specifically, ACCoNet consists of three parts: an encoder, Adjacent Context Coordination Modules (ACCoMs), and a decoder. As the key component of ACCoNet, ACCoM activates the salient regions of output features of the encoder and transmits them to the decoder. ACCoM contains a local branch and two adjacent branches to coordinate the multi-level features simultaneously. The local branch highlights the salient regions in an adaptive way, while the adjacent branches introduce global information of adjacent levels to enhance salient regions. Additionally, to extend the capabilities of the classic decoder block (i.e., several cascaded convolutional layers), we extend it with two bifurcations and propose a Bifurcation-Aggregation Block to capture the contextual information in the decoder. Extensive experiments on two benchmark datasets demonstrate that the proposed ACCoNet outperforms 22 state-of-the-art methods under nine evaluation metrics, and runs up to 81 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/ACCoNet.
CVJan 7, 2023Code
Lightweight Salient Object Detection in Optical Remote-Sensing Images via Semantic Matching and Edge AlignmentGongyang Li, Zhi Liu, Xinpeng Zhang et al.
Recently, relying on convolutional neural networks (CNNs), many methods for salient object detection in optical remote sensing images (ORSI-SOD) are proposed. However, most methods ignore the huge parameters and computational cost brought by CNNs, and only a few pay attention to the portability and mobility. To facilitate practical applications, in this paper, we propose a novel lightweight network for ORSI-SOD based on semantic matching and edge alignment, termed SeaNet. Specifically, SeaNet includes a lightweight MobileNet-V2 for feature extraction, a dynamic semantic matching module (DSMM) for high-level features, an edge self-alignment module (ESAM) for low-level features, and a portable decoder for inference. First, the high-level features are compressed into semantic kernels. Then, semantic kernels are used to activate salient object locations in two groups of high-level features through dynamic convolution operations in DSMM. Meanwhile, in ESAM, cross-scale edge information extracted from two groups of low-level features is self-aligned through L2 loss and used for detail enhancement. Finally, starting from the highest-level features, the decoder infers salient objects based on the accurate locations and fine details contained in the outputs of the two modules. Extensive experiments on two public datasets demonstrate that our lightweight SeaNet not only outperforms most state-of-the-art lightweight methods but also yields comparable accuracy with state-of-the-art conventional methods, while having only 2.76M parameters and running with 1.7G FLOPs for 288x288 inputs. Our code and results are available at https://github.com/MathLee/SeaNet.
CVOct 11, 2022Code
Neighbourhood Representative Sampling for Efficient End-to-end Video Quality AssessmentHaoning Wu, Chaofeng Chen, Liang Liao et al.
The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA). On the one hand, keeping the original resolution will lead to unacceptable computational costs. On the other hand, existing practices, such as resizing and cropping, will change the quality of original videos due to the loss of details and contents, and are therefore harmful to quality assessment. With the obtained insight from the study of spatial-temporal redundancy in the human visual system and visual coding theory, we observe that quality information around a neighbourhood is typically similar, motivating us to investigate an effective quality-sensitive neighbourhood representatives scheme for VQA. In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments. Full-resolution videos are first divided into mini-cubes with preset spatial-temporal grids, then the temporal-aligned quality representatives are sampled to compose the fragments that serve as inputs for VQA. In addition, we design the Fragment Attention Network (FANet), a network architecture tailored specifically for fragments. With fragments and FANet, the proposed efficient end-to-end FAST-VQA and FasterVQA achieve significantly better performance than existing approaches on all VQA benchmarks while requiring only 1/1612 FLOPs compared to the current state-of-the-art. Codes, models and demos are available at https://github.com/timothyhtimothy/FAST-VQA-and-FasterVQA.
CVJun 9, 2023Code
GMS-3DQA: Projection-based Grid Mini-patch Sampling for 3D Model Quality AssessmentZicheng Zhang, Wei Sun, Houning Wu et al.
Nowadays, most 3D model quality assessment (3DQA) methods have been aimed at improving performance. However, little attention has been paid to the computational cost and inference time required for practical applications. Model-based 3DQA methods extract features directly from the 3D models, which are characterized by their high degree of complexity. As a result, many researchers are inclined towards utilizing projection-based 3DQA methods. Nevertheless, previous projection-based 3DQA methods directly extract features from multi-projections to ensure quality prediction accuracy, which calls for more resource consumption and inevitably leads to inefficiency. Thus in this paper, we address this challenge by proposing a no-reference (NR) projection-based \textit{\underline{G}rid \underline{M}ini-patch \underline{S}ampling \underline{3D} Model \underline{Q}uality \underline{A}ssessment (GMS-3DQA)} method. The projection images are rendered from six perpendicular viewpoints of the 3D model to cover sufficient quality information. To reduce redundancy and inference resources, we propose a multi-projection grid mini-patch sampling strategy (MP-GMS), which samples grid mini-patches from the multi-projections and forms the sampled grid mini-patches into one quality mini-patch map (QMM). The Swin-Transformer tiny backbone is then used to extract quality-aware features from the QMMs. The experimental results show that the proposed GMS-3DQA outperforms existing state-of-the-art NR-3DQA methods on the point cloud quality assessment databases. The efficiency analysis reveals that the proposed GMS-3DQA requires far less computational resources and inference time than other 3DQA competitors. The code will be available at https://github.com/zzc-1998/GMS-3DQA.
CVDec 3, 2022Code
Learning Detail-Structure Alternative Optimization for Blind Super-ResolutionFeng Li, Yixuan Wu, Huihui Bai et al.
Existing convolutional neural networks (CNN) based image super-resolution (SR) methods have achieved impressive performance on bicubic kernel, which is not valid to handle unknown degradations in real-world applications. Recent blind SR methods suggest to reconstruct SR images relying on blur kernel estimation. However, their results still remain visible artifacts and detail distortion due to the estimation errors. To alleviate these problems, in this paper, we propose an effective and kernel-free network, namely DSSR, which enables recurrent detail-structure alternative optimization without blur kernel prior incorporation for blind SR. Specifically, in our DSSR, a detail-structure modulation module (DSMM) is built to exploit the interaction and collaboration of image details and structures. The DSMM consists of two components: a detail restoration unit (DRU) and a structure modulation unit (SMU). The former aims at regressing the intermediate HR detail reconstruction from LR structural contexts, and the latter performs structural contexts modulation conditioned on the learned detail maps at both HR and LR spaces. Besides, we use the output of DSMM as the hidden state and design our DSSR architecture from a recurrent convolutional neural network (RCNN) view. In this way, the network can alternatively optimize the image details and structural contexts, achieving co-optimization across time. Moreover, equipped with the recurrent connection, our DSSR allows low- and high-level feature representations complementary by observing previous HR details and contexts at every unrolling time. Extensive experiments on synthetic datasets and real-world images demonstrate that our method achieves the state-of-the-art against existing methods. The source code can be found at https://github.com/Arcananana/DSSR.
IVJul 6, 2023Code
Advancing Zero-Shot Digital Human Quality Assessment through Text-Prompted EvaluationZicheng Zhang, Wei Sun, Yingjie Zhou et al.
Digital humans have witnessed extensive applications in various domains, necessitating related quality assessment studies. However, there is a lack of comprehensive digital human quality assessment (DHQA) databases. To address this gap, we propose SJTU-H3D, a subjective quality assessment database specifically designed for full-body digital humans. It comprises 40 high-quality reference digital humans and 1,120 labeled distorted counterparts generated with seven types of distortions. The SJTU-H3D database can serve as a benchmark for DHQA research, allowing evaluation and refinement of processing algorithms. Further, we propose a zero-shot DHQA approach that focuses on no-reference (NR) scenarios to ensure generalization capabilities while mitigating database bias. Our method leverages semantic and distortion features extracted from projections, as well as geometry features derived from the mesh structure of digital humans. Specifically, we employ the Contrastive Language-Image Pre-training (CLIP) model to measure semantic affinity and incorporate the Naturalness Image Quality Evaluator (NIQE) model to capture low-level distortion information. Additionally, we utilize dihedral angles as geometry descriptors to extract mesh features. By aggregating these measures, we introduce the Digital Human Quality Index (DHQI), which demonstrates significant improvements in zero-shot performance. The DHQI can also serve as a robust baseline for DHQA tasks, facilitating advancements in the field. The database and the code are available at https://github.com/zzc-1998/SJTU-H3D.
CVJul 8, 2022Code
Exploring the Effectiveness of Video Perceptual Representation in Blind Video Quality AssessmentLiang Liao, Kangmin Xu, Haoning Wu et al.
With the rapid growth of in-the-wild videos taken by non-specialists, blind video quality assessment (VQA) has become a challenging and demanding problem. Although lots of efforts have been made to solve this problem, it remains unclear how the human visual system (HVS) relates to the temporal quality of videos. Meanwhile, recent work has found that the frames of natural video transformed into the perceptual domain of the HVS tend to form a straight trajectory of the representations. With the obtained insight that distortion impairs the perceived video quality and results in a curved trajectory of the perceptual representation, we propose a temporal perceptual quality index (TPQI) to measure the temporal distortion by describing the graphic morphology of the representation. Specifically, we first extract the video perceptual representations from the lateral geniculate nucleus (LGN) and primary visual area (V1) of the HVS, and then measure the straightness and compactness of their trajectories to quantify the degradation in naturalness and content continuity of video. Experiments show that the perceptual representation in the HVS is an effective way of predicting subjective temporal quality, and thus TPQI can, for the first time, achieve comparable performance to the spatial quality metric and be even more effective in assessing videos with large temporal variations. We further demonstrate that by combining with NIQE, a spatial quality metric, TPQI can achieve top performance over popular in-the-wild video datasets. More importantly, TPQI does not require any additional information beyond the video being evaluated and thus can be applied to any datasets without parameter tuning. Source code is available at https://github.com/UoLMM/TPQI-VQA.
CVAug 21, 2023Code
Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural NetworksKaixin Xu, Zhe Wang, Xue Geng et al.
In this paper, we propose a novel layer-adaptive weight-pruning approach for Deep Neural Networks (DNNs) that addresses the challenge of optimizing the output distortion minimization while adhering to a target pruning ratio constraint. Our approach takes into account the collective influence of all layers to design a layer-adaptive pruning scheme. We discover and utilize a very important additivity property of output distortion caused by pruning weights on multiple layers. This property enables us to formulate the pruning as a combinatorial optimization problem and efficiently solve it through dynamic programming. By decomposing the problem into sub-problems, we achieve linear time complexity, making our optimization algorithm fast and feasible to run on CPUs. Our extensive experiments demonstrate the superiority of our approach over existing methods on the ImageNet and CIFAR-10 datasets. On CIFAR-10, our method achieves remarkable improvements, outperforming others by up to 1.0% for ResNet-32, 0.5% for VGG-16, and 0.7% for DenseNet-121 in terms of top-1 accuracy. On ImageNet, we achieve up to 4.7% and 4.6% higher top-1 accuracy compared to other methods for VGG-16 and ResNet-50, respectively. These results highlight the effectiveness and practicality of our approach for enhancing DNN performance through layer-adaptive weight pruning. Code will be available on https://github.com/Akimoto-Cris/RD_VIT_PRUNE.
CVSep 30, 2024Code
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMsZicheng Zhang, Ziheng Jia, Haoning Wu et al.
With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.
CVJun 27, 2023Code
You Can Mask More For Extremely Low-Bitrate Image CompressionAnqi Li, Feng Li, Jiaxin Han et al.
Learned image compression (LIC) methods have experienced significant progress during recent years. However, these methods are primarily dedicated to optimizing the rate-distortion (R-D) performance at medium and high bitrates (> 0.1 bits per pixel (bpp)), while research on extremely low bitrates is limited. Besides, existing methods fail to explicitly explore the image structure and texture components crucial for image compression, treating them equally alongside uninformative components in networks. This can cause severe perceptual quality degradation, especially under low-bitrate scenarios. In this work, inspired by the success of pre-trained masked autoencoders (MAE) in many downstream tasks, we propose to rethink its mask sampling strategy from structure and texture perspectives for high redundancy reduction and discriminative feature representation, further unleashing the potential of LIC methods. Therefore, we present a dual-adaptive masking approach (DA-Mask) that samples visible patches based on the structure and texture distributions of original images. We combine DA-Mask and pre-trained MAE in masked image modeling (MIM) as an initial compressor that abstracts informative semantic context and texture representations. Such a pipeline can well cooperate with LIC networks to achieve further secondary compression while preserving promising reconstruction quality. Consequently, we propose a simple yet effective masked compression model (MCM), the first framework that unifies MIM and LIC end-to-end for extremely low-bitrate image compression. Extensive experiments have demonstrated that our approach outperforms recent state-of-the-art methods in R-D performance, visual quality, and downstream applications, at very low bitrates. Our code is available at https://github.com/lianqi1008/MCM.git.
CVAug 16, 2022Code
HVS-Inspired Signal Degradation Network for Just Noticeable Difference EstimationJian Jin, Yuan Xue, Xingxing Zhang et al.
Significant improvement has been made on just noticeable difference (JND) modelling due to the development of deep neural networks, especially for the recently developed unsupervised-JND generation models. However, they have a major drawback that the generated JND is assessed in the real-world signal domain instead of in the perceptual domain in the human brain. There is an obvious difference when JND is assessed in such two domains since the visual signal in the real world is encoded before it is delivered into the brain with the human visual system (HVS). Hence, we propose an HVS-inspired signal degradation network for JND estimation. To achieve this, we carefully analyze the HVS perceptual process in JND subjective viewing to obtain relevant insights, and then design an HVS-inspired signal degradation (HVS-SD) network to represent the signal degradation in the HVS. On the one hand, the well learnt HVS-SD enables us to assess the JND in the perceptual domain. On the other hand, it provides more accurate prior information for better guiding JND generation. Additionally, considering the requirement that reasonable JND should not lead to visual attention shifting, a visual attention loss is proposed to control JND generation. Experimental results demonstrate that the proposed method achieves the SOTA performance for accurately estimating the redundancy of the HVS. Source code will be available at https://github.com/jianjin008/HVS-SD-JND.
CVJul 24, 2024Code
Q-Ground: Image Quality Grounding with Large Multi-modality ModelsChaofeng Chen, Sensen Yang, Haoning Wu et al.
Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In this work, we introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset, a novel resource containing 100k triplets of (image, quality text, distortion segmentation) to facilitate deep investigations into visual quality. The dataset comprises two parts: one with human-labeled annotations for accurate quality assessment, and another labeled automatically by LMMs such as GPT4V, which helps improve the robustness of model training while also reducing the costs of data collection. With the QGround-100K dataset, we propose a LMM-based method equipped with multi-scale feature learning to learn models capable of performing both image quality answering and distortion segmentation based on text prompts. This dual-capability approach not only refines the model's understanding of region-aware image quality but also enables it to interactively respond to complex, text-based queries about image quality and specific distortions. Q-Ground takes a step towards sophisticated visual quality analysis in a finer scale, establishing a new benchmark for future research in the area. Codes and dataset are available at https://github.com/Q-Future/Q-Ground.
CVSep 25, 2023
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level VisionHaoning Wu, Zicheng Zhang, Erli Zhang et al.
The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.
CVJun 20, 2022
DisCoVQA: Temporal Distortion-Content Transformers for Video Quality AssessmentHaoning Wu, Chaofeng Chen, Liang Liao et al.
The temporal relationships between frames and their influences on video quality assessment (VQA) are still under-studied in existing works. These relationships lead to two important types of effects for video quality. Firstly, some temporal variations (such as shaking, flicker, and abrupt scene transitions) are causing temporal distortions and lead to extra quality degradations, while other variations (e.g. those related to meaningful happenings) do not. Secondly, the human visual system often has different attention to frames with different contents, resulting in their different importance to the overall video quality. Based on prominent time-series modeling ability of transformers, we propose a novel and effective transformer-based VQA method to tackle these two issues. To better differentiate temporal variations and thus capture the temporal distortions, we design a transformer-based Spatial-Temporal Distortion Extraction (STDE) module. To tackle with temporal quality attention, we propose the encoder-decoder-like temporal content transformer (TCT). We also introduce the temporal sampling on features to reduce the input length for the TCT, so as to improve the learning effectiveness and efficiency of this module. Consisting of the STDE and the TCT, the proposed Temporal Distortion-Content Transformers for Video Quality Assessment (DisCoVQA) reaches state-of-the-art performance on several VQA benchmarks without any extra pre-training datasets and up to 10% better generalization ability than existing methods. We also conduct extensive ablation experiments to prove the effectiveness of each part in our proposed model, and provide visualizations to prove that the proposed modules achieve our intention on modeling these temporal issues. We will publish our codes and pretrained weights later.
IVApr 14, 2023Code
Perceptual Quality Assessment of Face Video Compression: A Benchmark and An Effective MethodYixuan Li, Bolin Chen, Baoliang Chen et al.
Recent years have witnessed an exponential increase in the demand for face video compression, and the success of artificial intelligence has expanded the boundaries beyond traditional hybrid video coding. Generative coding approaches have been identified as promising alternatives with reasonable perceptual rate-distortion trade-offs, leveraging the statistical priors of face videos. However, the great diversity of distortion types in spatial and temporal domains, ranging from the traditional hybrid coding frameworks to generative models, present grand challenges in compressed face video quality assessment (VQA). In this paper, we introduce the large-scale Compressed Face Video Quality Assessment (CFVQA) database, which is the first attempt to systematically understand the perceptual quality and diversified compression distortions in face videos. The database contains 3,240 compressed face video clips in multiple compression levels, which are derived from 135 source videos with diversified content using six representative video codecs, including two traditional methods based on hybrid coding frameworks, two end-to-end methods, and two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index for face video compression was developed to measure the perceptual quality, considering the distinct content characteristics and temporal priors of the face videos. Experimental results exhibit its superior performance on the proposed CFVQA dataset. The benchmark is now made publicly available at: https://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment.
CVNov 5, 2022
KSS-ICP: Point Cloud Registration based on Kendall Shape SpaceChenlei Lv, Weisi Lin, Baoquan Zhao
Point cloud registration is a popular topic which has been widely used in 3D model reconstruction, location, and retrieval. In this paper, we propose a new registration method, KSS-ICP, to address the rigid registration task in Kendall shape space (KSS) with Iterative Closest Point (ICP). The KSS is a quotient space that removes influences of translations, scales, and rotations for shape feature-based analysis. Such influences can be concluded as the similarity transformations that do not change the shape feature. The point cloud representation in KSS is invariant to similarity transformations. We utilize such property to design the KSS-ICP for point cloud registration. To tackle the difficulty to achieve the KSS representation in general, the proposed KSS-ICP formulates a practical solution that does not require complex feature analysis, data training, and optimization. With a simple implementation, KSS-ICP achieves more accurate registration from point clouds. It is robust to similarity transformation, non-uniform density, noise, and defective parts. Experiments show that KSS-ICP has better performance than the state of the art.
MMJun 2, 2022
Distilling Knowledge from Object Classification to Aesthetics AssessmentJingwen Hou, Henghui Ding, Weisi Lin et al.
In this work, we point out that the major dilemma of image aesthetics assessment (IAA) comes from the abstract nature of aesthetic labels. That is, a vast variety of distinct contents can correspond to the same aesthetic label. On the one hand, during inference, the IAA model is required to relate various distinct contents to the same aesthetic label. On the other hand, when training, it would be hard for the IAA model to learn to distinguish different contents merely with the supervision from aesthetic labels, since aesthetic labels are not directly related to any specific content. To deal with this dilemma, we propose to distill knowledge on semantic patterns for a vast variety of image contents from multiple pre-trained object classification (POC) models to an IAA model. Expecting the combination of multiple POC models can provide sufficient knowledge on various image contents, the IAA model can easier learn to relate various distinct contents to a limited number of aesthetic labels. By supervising an end-to-end single-backbone IAA model with the distilled knowledge, the performance of the IAA model is significantly improved by 4.8% in SRCC compared to the version trained only with ground-truth aesthetic labels. On specific categories of images, the SRCC improvement brought by the proposed method can achieve up to 7.2%. Peer comparison also shows that our method outperforms 10 previous IAA methods.
CVMay 23, 2022
SelfReformer: Self-Refined Network with Transformer for Salient Object DetectionYi Ke Yun, Weisi Lin
The global and local contexts significantly contribute to the integrity of predictions in Salient Object Detection (SOD). Unfortunately, existing methods still struggle to generate complete predictions with fine details. There are two major problems in conventional approaches: first, for global context, high-level CNN-based encoder features cannot effectively catch long-range dependencies, resulting in incomplete predictions. Second, downsampling the ground truth to fit the size of predictions will introduce inaccuracy as the ground truth details are lost during interpolation or pooling. Thus, in this work, we developed a Transformer-based network and framed a supervised task for a branch to learn the global context information explicitly. Besides, we adopt Pixel Shuffle from Super-Resolution (SR) to reshape the predictions back to the size of ground truth instead of the reverse. Thus details in the ground truth are untouched. In addition, we developed a two-stage Context Refinement Module (CRM) to fuse global context and automatically locate and refine the local details in the predictions. The proposed network can guide and correct itself based on the global and local context generated, thus is named, Self-Refined Transformer (SelfReformer). Extensive experiments and evaluation results on five benchmark datasets demonstrate the outstanding performance of the network, and we achieved the state-of-the-art.
CVNov 9, 2022Code
DeepDC: Deep Distance Correlation as a Perceptual Image Quality EvaluatorHanwei Zhu, Baoliang Chen, Lingyu Zhu et al.
ImageNet pre-trained deep neural networks (DNNs) show notable transferability for building effective image quality assessment (IQA) models. Such a remarkable byproduct has often been identified as an emergent property in previous studies. In this work, we attribute such capability to the intrinsic texture-sensitive characteristic that classifies images using texture features. We fully exploit this characteristic to develop a novel full-reference IQA (FR-IQA) model based exclusively on pre-trained DNN features. Specifically, we compute the distance correlation, a highly promising yet relatively under-investigated statistic, between reference and distorted images in the deep feature domain. In addition, the distance correlation quantifies both linear and nonlinear feature relationships, which is far beyond the widely used first-order and second-order statistics in the feature space. We conduct comprehensive experiments to demonstrate the superiority of the proposed quality model on five standard IQA datasets, one perceptual similarity dataset, two texture similarity datasets, and one geometric transformation dataset. Moreover, we optimize the proposed model to generate a broad spectrum of texture patterns, by treating the model as the style loss function for neural style transfer (NST). Extensive experiments demonstrate that the proposed texture synthesis and NST methods achieve the best quantitative and qualitative results. We release our code at https://github.com/h4nwei/DeepDC.
CVFeb 26, 2023
Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity CriterionHaoning Wu, Liang Liao, Jingwen Hou et al.
Recent learning-based video quality assessment (VQA) algorithms are expensive to implement due to the cost of data collection of human quality opinions, and are less robust across various scenarios due to the biases of these opinions. This motivates our exploration on opinion-unaware (a.k.a zero-shot) VQA approaches. Existing approaches only considers low-level naturalness in spatial or temporal domain, without considering impacts from high-level semantics. In this work, we introduce an explicit semantic affinity index for opinion-unaware VQA using text-prompts in the contrastive language-image pre-training (CLIP) model. We also aggregate it with different traditional low-level naturalness indexes through gaussian normalization and sigmoid rescaling strategies. Composed of aggregated semantic and technical metrics, the proposed Blind Unified Opinion-Unaware Video Quality Index via Semantic and Technical Metric Aggregation (BUONA-VISTA) outperforms existing opinion-unaware VQA methods by at least 20% improvements, and is more robust than opinion-aware approaches.
CVNov 15, 2022
IntegratedPIFu: Integrated Pixel Aligned Implicit Function for Single-view Human ReconstructionKennard Yanting Chan, Guosheng Lin, Haiyu Zhao et al.
We propose IntegratedPIFu, a new pixel aligned implicit model that builds on the foundation set by PIFuHD. IntegratedPIFu shows how depth and human parsing information can be predicted and capitalised upon in a pixel-aligned implicit model. In addition, IntegratedPIFu introduces depth oriented sampling, a novel training scheme that improve any pixel aligned implicit model ability to reconstruct important human features without noisy artefacts. Lastly, IntegratedPIFu presents a new architecture that, despite using less model parameters than PIFuHD, is able to improves the structural correctness of reconstructed meshes. Our results show that IntegratedPIFu significantly outperforms existing state of the arts methods on single view human reconstruction. Our code has been made available online.
CVJun 17, 2022
Minimum Noticeable Difference based Adversarial Privacy Preserving Image GenerationWen Sun, Jian Jin, Weisi Lin
Deep learning models are found to be vulnerable to adversarial examples, as wrong predictions can be caused by small perturbation in input for deep learning models. Most of the existing works of adversarial image generation try to achieve attacks for most models, while few of them make efforts on guaranteeing the perceptual quality of the adversarial examples. High quality adversarial examples matter for many applications, especially for the privacy preserving. In this work, we develop a framework based on the Minimum Noticeable Difference (MND) concept to generate adversarial privacy preserving images that have minimum perceptual difference from the clean ones but are able to attack deep learning models. To achieve this, an adversarial loss is firstly proposed to make the deep learning models attacked by the adversarial images successfully. Then, a perceptual quality-preserving loss is developed by taking the magnitude of perturbation and perturbation-caused structural and gradient changes into account, which aims to preserve high perceptual quality for adversarial image generation. To the best of our knowledge, this is the first work on exploring quality-preserving adversarial image generation based on the MND concept for privacy preserving. To evaluate its performance in terms of perceptual quality, the deep models on image classification and face recognition are tested with the proposed method and several anchor methods in this work. Extensive experimental results demonstrate that the proposed MND framework is capable of generating adversarial images with remarkably improved performance metrics (e.g., PSNR, SSIM, and MOS) than that generated with the anchor methods.
CVDec 3, 2022
Bridging Component Learning with Degradation Modelling for Blind Image Super-ResolutionYixuan Wu, Feng Li, Huihui Bai et al.
Convolutional Neural Network (CNN)-based image super-resolution (SR) has exhibited impressive success on known degraded low-resolution (LR) images. However, this type of approach is hard to hold its performance in practical scenarios when the degradation process is unknown. Despite existing blind SR methods proposed to solve this problem using blur kernel estimation, the perceptual quality and reconstruction accuracy are still unsatisfactory. In this paper, we analyze the degradation of a high-resolution (HR) image from image intrinsic components according to a degradation-based formulation model. We propose a components decomposition and co-optimization network (CDCN) for blind SR. Firstly, CDCN decomposes the input LR image into structure and detail components in feature space. Then, the mutual collaboration block (MCB) is presented to exploit the relationship between both two components. In this way, the detail component can provide informative features to enrich the structural context and the structure component can carry structural context for better detail revealing via a mutual complementary manner. After that, we present a degradation-driven learning strategy to jointly supervise the HR image detail and structure restoration process. Finally, a multi-scale fusion module followed by an upsampling layer is designed to fuse the structure and detail features and perform SR reconstruction. Empowered by such degradation-based components decomposition, collaboration, and mutual optimization, we can bridge the correlation between component learning and degradation modelling for blind SR, thereby producing SR results with more accurate textures. Extensive experiments on both synthetic SR datasets and real-world images show that the proposed method achieves the state-of-the-art performance compared to existing methods.
MMAug 31, 2022
Blind Quality Assessment of 3D Dense Point Clouds with Structure Guided ResamplingWei Zhou, Qi Yang, Qiuping Jiang et al.
Objective quality assessment of 3D point clouds is essential for the development of immersive multimedia systems in real-world applications. Despite the success of perceptual quality evaluation for 2D images and videos, blind/no-reference metrics are still scarce for 3D point clouds with large-scale irregularly distributed 3D points. Therefore, in this paper, we propose an objective point cloud quality index with Structure Guided Resampling (SGR) to automatically evaluate the perceptually visual quality of 3D dense point clouds. The proposed SGR is a general-purpose blind quality assessment method without the assistance of any reference information. Specifically, considering that the human visual system (HVS) is highly sensitive to structure information, we first exploit the unique normal vectors of point clouds to execute regional pre-processing which consists of keypoint resampling and local region construction. Then, we extract three groups of quality-related features, including: 1) geometry density features; 2) color naturalness features; 3) angular consistency features. Both the cognitive peculiarities of the human brain and naturalness regularity are involved in the designed quality-aware features that can capture the most vital aspects of distorted 3D point clouds. Extensive experiments on several publicly available subjective point cloud quality databases validate that our proposed SGR can compete with state-of-the-art full-reference, reduced-reference, and no-reference quality assessment algorithms.
CVApr 28, 2023
Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality AssessmentHaoning Wu, Liang Liao, Annan Wang et al.
The proliferation of videos collected during in-the-wild natural settings has pushed the development of effective Video Quality Assessment (VQA) methodologies. Contemporary supervised opinion-driven VQA strategies predominantly hinge on training from expensive human annotations for quality scores, which limited the scale and distribution of VQA datasets and consequently led to unsatisfactory generalization capacity of methods driven by these data. On the other hand, although several handcrafted zero-shot quality indices do not require training from human opinions, they are unable to account for the semantics of videos, rendering them ineffective in comprehending complex authentic distortions (e.g., white balance, exposure) and assessing the quality of semantic content within videos. To address these challenges, we introduce the text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP) to ascertain the affinity between textual prompts and visual features, facilitating a comprehensive examination of semantic quality concerns without the reliance on human quality annotations. By amalgamating SAQI with existing low-level metrics, we propose the unified Blind Video Quality Index (BVQI) and its improved version, BVQI-Local, which demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24\% on all datasets. Moreover, we devise an efficient fine-tuning scheme for BVQI-Local that jointly optimizes text prompts and final fusion weights, resulting in state-of-the-art performance and superior generalization ability in comparison to prevalent opinion-driven VQA methods. We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.
IVSep 25, 2023Code
IBVC: Interpolation-driven B-frame Video CompressionChenming Xu, Meiqin Liu, Chao Yao et al.
Learned B-frame video compression aims to adopt bi-directional motion estimation and motion compensation (MEMC) coding for middle frame reconstruction. However, previous learned approaches often directly extend neural P-frame codecs to B-frame relying on bi-directional optical-flow estimation or video frame interpolation. They suffer from inaccurate quantized motions and inefficient motion compensation. To address these issues, we propose a simple yet effective structure called Interpolation-driven B-frame Video Compression (IBVC). Our approach only involves two major operations: video frame interpolation and artifact reduction compression. IBVC introduces a bit-rate free MEMC based on interpolation, which avoids optical-flow quantization and additional compression distortions. Later, to reduce duplicate bit-rate consumption and focus on unaligned artifacts, a residual guided masking encoder is deployed to adaptively select the meaningful contexts with interpolated multi-scale dependencies. In addition, a conditional spatio-temporal decoder is proposed to eliminate location errors and artifacts instead of using MEMC coding in other methods. The experimental results on B-frame coding demonstrate that IBVC has significant improvements compared to the relevant state-of-the-art methods. Meanwhile, our approach can save bit rates compared with the random access (RA) configuration of H.266 (VTM). The code will be available at https://github.com/ruhig6/IBVC.
CVNov 12, 2023
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation ModelsHaoning Wu, Zicheng Zhang, Erli Zhang et al.
Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.
CVOct 14, 2023Code
You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality AssessmentYi Ke Yun, Weisi Lin
Although recent efforts in image quality assessment (IQA) have achieved promising performance, there still exists a considerable gap compared to the human visual system (HVS). One significant disparity lies in humans' seamless transition between full reference (FR) and no reference (NR) tasks, whereas existing models are constrained to either FR or NR tasks. This disparity implies the necessity of designing two distinct systems, thereby greatly diminishing the model's versatility. Therefore, our focus lies in unifying FR and NR IQA under a single framework. Specifically, we first employ an encoder to extract multi-level features from input images. Then a Hierarchical Attention (HA) module is proposed as a universal adapter for both FR and NR inputs to model the spatial distortion at each encoder stage. Furthermore, considering that different distortions contaminate encoder stages and damage image semantic meaning differently, a Semantic Distortion Aware (SDA) module is proposed to examine feature correlations between shallow and deep layers of the encoder. By adopting HA and SDA, the proposed network can effectively perform both FR and NR IQA. When our proposed model is independently trained on NR or FR IQA tasks, it outperforms existing models and achieves state-of-the-art performance. Moreover, when trained jointly on NR and FR IQA tasks, it further enhances the performance of NR IQA while achieving on-par performance in the state-of-the-art FR IQA. You only train once to perform both IQA tasks. Code will be released at: https://github.com/BarCodeReader/YOTO.
CVAug 23, 2023
Local Distortion Aware Efficient Transformer Adaptation for Image Quality AssessmentKangmin Xu, Liang Liao, Jing Xiao et al.
Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision, yet it remains an unresolved challenge, owing to the intricate distortion conditions, diverse image contents, and limited availability of data. Recently, the community has witnessed the emergence of numerous large-scale pretrained foundation models, which greatly benefit from dramatically increased data and parameter capacities. However, it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA task which is closely related to low-level clues. In this paper, we demonstrate that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks. Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), alongside the large-scale pretrained ViT, we use another pretrained convolution neural network (CNN), which is well known for capturing the local structure, to extract multi-scale image features. Further, we propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector, our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets, indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models.
CVDec 12, 2025Code
Using GUI Agent for Electronic Design AutomationChunyi Li, Longfei Li, Zicheng Zhang et al.
Graphical User Interface (GUI) agents adopt an end-to-end paradigm that maps a screenshot to an action sequence, thereby automating repetitive tasks in virtual environments. However, existing GUI agents are evaluated almost exclusively on commodity software such as Microsoft Word and Excel. Professional Computer-Aided Design (CAD) suites promise an order-of-magnitude higher economic return, yet remain the weakest performance domain for existing agents and are still far from replacing expert Electronic-Design-Automation (EDA) engineers. We therefore present the first systematic study that deploys GUI agents for EDA workflows. Our contributions are: (1) a large-scale dataset named GUI-EDA, including 5 CAD tools and 5 physical domains, comprising 2,000+ high-quality screenshot-answer-action pairs recorded by EDA scientists and engineers during real-world component design; (2) a comprehensive benchmark that evaluates 30+ mainstream GUI agents, demonstrating that EDA tasks constitute a major, unsolved challenge; and (3) an EDA-specialized metric named EDAgent, equipped with a reflection mechanism that achieves reliable performance on industrial CAD software and, for the first time, outperforms Ph.D. students majored in Electrical Engineering. This work extends GUI agents from generic office automation to specialized, high-value engineering domains and offers a new avenue for advancing EDA productivity. The dataset will be released at: https://github.com/aiben-ch/GUI-EDA.
IVMar 1, 2022
Full RGB Just Noticeable Difference (JND) ModellingJian Jin, Dong Yu, Weisi Lin et al.
Just Noticeable Difference (JND) has many applications in multimedia signal processing, especially for visual data processing up to date. It's generally defined as the minimum visual content changes that the human can perspective, which has been studied for decades. However, most of the existing methods only focus on the luminance component of JND modelling and simply regard chrominance components as scaled versions of luminance. In this paper, we propose a JND model to generate the JND by taking the characteristics of full RGB channels into account, termed as the RGB-JND. To this end, an RGB-JND-NET is proposed, where the visual content in full RGB channels is used to extract features for JND generation. To supervise the JND generation, an adaptive image quality assessment combination (AIC) is developed. Besides, the RDB-JND-NET also takes the visual attention into account by automatically mining the underlying relationship between visual attention and the JND, which is further used to constrain the JND spatial distribution. To the best of our knowledge, this is the first work on careful investigation of JND modelling for full-color space. Experimental results demonstrate that the RGB-JND-NET model outperforms the relevant state-of-the-art JND models. Besides, the JND of the red and blue channels are larger than that of the green one according to the experimental results of the proposed model, which demonstrates that more changes can be tolerated in the red and blue channels, in line with the well-known fact that the human visual system is more sensitive to the green channel in comparison with the red and blue ones.
CVAug 29, 2024Code
Revisiting 360 Depth Estimation with PanoGabor: A New Fusion PerspectiveZhijie Shen, Chunyu Lin, Lang Nie et al.
Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations ({e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces the troublesome distortions. In this work, we propose an oriented distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, thereby extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a linear latitude-aware distortion representation method to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel-wise and spatial-wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-free features. Considering the orientation sensitivity of the Gabor transform, we introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse
CVMay 20Code
Towards Large Model Feature CodingYouwei Pang, Changsheng Gao, Dong Liu et al.
Large models have delivered remarkable performance across a wide range of perception and generation tasks, yet practical deployment is increasingly constrained by computational and memory budgets, as well as privacy requirements. Split execution alleviates these constraints by partitioning computation across devices, but it inevitably introduces intensive transmission and storage of intermediate features. Unlike conventional feature coding for CNNs that typically targets homogeneous spatial activation maps, modern large models generate heterogeneous features with varying statistical distributions and compression tolerances, e.g., multi-level/multi-modal representations and autoregressive context caches. These characteristics necessitate treating large model feature coding (LaMoFC) as a fundamental system component and call for a systematic evaluation framework. In this paper, we present a comprehensive benchmark and evaluation framework for LaMoFC. We first build the feature dataset LaMoFCBench, covering diverse task requirements across 4 categories and 16 scenarios while integrating widelyadopted architectures and various split-computing settings. We then specify representative split points according to practical application scenarios to extract intermediate features, establishing a unified pipeline for fair and reproducible comparisons. Finally, we benchmark mainstream universal feature codecs, exposing the profound misalignment between existing coding paradigms and the heterogeneous nature of large model features. These findings reveal that LaMoFC demands a fundamental departure from existing paradigms, and LaMoFCBench provides the shared empirical foundation to drive this transition. The data and code will be available at https://github.com/lartpang/LaMoFCBench.
CVMar 18, 2023
Blind Multimodal Quality Assessment of Low-light ImagesMiaohui Wang, Zhuowei Xu, Mai Xu et al.
Blind image quality assessment (BIQA) aims at automatically and accurately forecasting objective scores for visual signals, which has been widely used to monitor product and service quality in low-light applications, covering smartphone photography, video surveillance, autonomous driving, etc. Recent developments in this field are dominated by unimodal solutions inconsistent with human subjective rating patterns, where human visual perception is simultaneously reflected by multiple sensory information. In this article, we present a unique blind multimodal quality assessment (BMQA) of low-light images from subjective evaluation to objective score. To investigate the multimodal mechanism, we first establish a multimodal low-light image quality (MLIQ) database with authentic low-light distortions, containing image-text modality pairs. Further, we specially design the key modules of BMQA, considering multimodal quality representation, latent feature alignment and fusion, and hybrid self-supervised and supervised learning. Extensive experiments show that our BMQA yields state-of-the-art accuracy on the proposed MLIQ benchmark database. In particular, we also build an independent single-image modality Dark-4K database, which is used to verify its applicability and generalization performance in mainstream unimodal applications. Qualitative and quantitative results on Dark-4K show that BMQA achieves superior performance to existing BIQA approaches as long as a pre-trained model is provided to generate text description. The proposed framework and two databases as well as the collected BIQA methods and evaluation metrics are made publicly available on here.
IVFeb 25, 2023
JND-Based Perceptual Optimization For Learned Image CompressionFeng Ding, Jian Jin, Lili Meng et al.
Recently, learned image compression schemes have achieved remarkable improvements in image fidelity (e.g., PSNR and MS-SSIM) compared to conventional hybrid image coding ones due to their high-efficiency non-linear transform, end-to-end optimization frameworks, etc. However, few of them take the Just Noticeable Difference (JND) characteristic of the Human Visual System (HVS) into account and optimize learned image compression towards perceptual quality. To address this issue, a JND-based perceptual quality loss is proposed. Considering that the amounts of distortion in the compressed image at different training epochs under different Quantization Parameters (QPs) are different, we develop a distortion-aware adjustor. After combining them together, we can better assign the distortion in the compressed image with the guidance of JND to preserve the high perceptual quality. All these designs enable the proposed method to be flexibly applied to various learned image compression schemes with high scalability and plug-and-play advantages. Experimental results on the Kodak dataset demonstrate that the proposed method has led to better perceptual quality than the baseline model under the same bit rate.
CVFeb 23, 2023
Evaluating the Efficacy of Skincare Product: A Realistic Short-Term Facial Pore SimulationLing Li, Bandara Dissanayake, Tatsuya Omotezako et al.
Simulating the effects of skincare products on face is a potential new way to communicate the efficacy of skincare products in skin diagnostics and product recommendations. Furthermore, such simulations enable one to anticipate his/her skin conditions and better manage skin health. However, there is a lack of effective simulations today. In this paper, we propose the first simulation model to reveal facial pore changes after using skincare products. Our simulation pipeline consists of 2 steps: training data establishment and facial pore simulation. To establish training data, we collect face images with various pore quality indexes from short-term (8-weeks) clinical studies. People often experience significant skin fluctuations (due to natural rhythms, external stressors, etc.,), which introduces large perturbations in clinical data. To address this problem, we propose a sliding window mechanism to clean data and select representative index(es) to represent facial pore changes. Facial pore simulation stage consists of 3 modules: UNet-based segmentation module to localize facial pores; regression module to predict time-dependent warping hyperparameters; and deformation module, taking warping hyperparameters and pore segmentation labels as inputs, to precisely deform pores accordingly. The proposed simulation is able to render realistic facial pore changes. And this work will pave the way for future research in facial skin simulation and skincare product developments.
GRMar 5, 2023
The First Comprehensive Dataset with Multiple Distortion Types for Visual Just-Noticeable DifferencesYaxuan Liu, Jian Jin, Yuan Xue et al.
Recently, with the development of deep learning, a number of Just Noticeable Difference (JND) datasets have been built for JND modeling. However, all the existing JND datasets only label the JND points based on the level of compression distortion. Hence, JND models learned from such datasets can only be used for image/video compression. As known, JND is a major characteristic of the human visual system (HVS), which reflects the maximum visual distortion that the HVS can tolerate. Hence, a generalized JND modeling should take more kinds of distortion types into account. To benefit JND modeling, this work establishes a generalized JND dataset with a coarse-to-fine JND selection, which contains 106 source images and 1,642 JND maps, covering 25 distortion types. To this end, we proposed a coarse JND candidate selection scheme to select the distorted images from the existing Image Quality Assessment (IQA) datasets as JND candidates instead of generating JND maps ourselves. Then, a fine JND selection is carried out on the JND candidates with a crowdsourced subjective assessment.
CVMar 4, 2023
MetaGrad: Adaptive Gradient Quantization with HypernetworksKaixin Xu, Alina Hui Xiu Lee, Ziyuan Zhao et al.
A popular track of network compression approach is Quantization aware Training (QAT), which accelerates the forward pass during the neural network training and inference. However, not much prior efforts have been made to quantize and accelerate the backward pass during training, even though that contributes around half of the training time. This can be partly attributed to the fact that errors of low-precision gradients during backward cannot be amortized by the training objective as in the QAT setting. In this work, we propose to solve this problem by incorporating the gradients into the computation graph of the next training iteration via a hypernetwork. Various experiments on CIFAR-10 dataset with different CNN network architectures demonstrate that our hypernetwork-based approach can effectively reduce the negative effect of gradient quantization noise and successfully quantizes the gradients to INT4 with only 0.64 accuracy drop for VGG-16 on CIFAR-10.
CVFeb 26Code
Scaling Audio-Visual Quality Assessment Dataset via CrowdsourcingRenyu Yang, Jian Jin, Lili Meng et al.
Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ
CVJul 18, 2023
Regression-free Blind Image Quality Assessment with Content-Distortion ConsistencyXiaoqi Wang, Jian Xiong, Hao Gao et al.
The optimization objective of regression-based blind image quality assessment (IQA) models is to minimize the mean prediction error across the training dataset, which can lead to biased parameter estimation due to potential training data biases. To mitigate this issue, we propose a regression-free framework for image quality evaluation, which is based upon retrieving locally similar instances by incorporating semantic and distortion feature spaces. The approach is motivated by the observation that the human visual system (HVS) exhibits analogous perceptual responses to semantically similar image contents impaired by identical distortions, which we term as content-distortion consistency. The proposed method constructs a hierarchical k-nearest neighbor (k-NN) algorithm for instance retrieval through two classification modules: semantic classification (SC) module and distortion classification (DC) module. Given a test image and an IQA database, the SC module retrieves multiple pristine images semantically similar to the test image. The DC module then retrieves instances based on distortion similarity from the distorted images that correspond to each retrieved pristine image. Finally, quality prediction is obtained by aggregating the subjective scores of the retrieved instances. Without training on subjective quality scores, the proposed regression-free method achieves competitive, even superior performance compared to state-of-the-art regression-based methods on authentic and synthetic distortion IQA benchmarks.
CVNov 27, 2023
Enhancing Diffusion Models with Text-Encoder Reinforcement LearningChaofeng Chen, Annan Wang, Haoning Wu et al.
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as \textbf{TexForce}. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.
CVJan 29Code
From Global to Granular: Revealing IQA Model Performance via Correlation SurfaceBaoliang Chen, Danni Huang, Hanwei Zhu et al.
Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to $|Δ$MOS$|$). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbf{Granularity-Modulated Correlation (GMC)}, which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbf{Granularity Modulator} that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ($|Δ$MOS$|$) to examine local performance variations, and (2) a \textbf{Distribution Regulator} that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbf{correlation surface} maps correlation values as a joint function of MOS and $|Δ$MOS$|$, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.
CVJul 2, 2024Code
LPViT: Low-Power Semi-structured Pruning for Vision TransformersKaixin Xu, Zhe Wang, Chunyun Chen et al.
Vision transformers have emerged as a promising alternative to convolutional neural networks for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve 3.93x speedup on dedicated hardware and GPUs respectively for DeiT-B, and a power reduction by 1.4x on GPUs. Code released to https://github.com/Akimoto-Cris/LPViT.
CVMar 4Code
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality AssessmentGuohua Zhang, Jian Jin, Meiqin Liu et al.
No-Reference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adaptation (UDA) to transfer quality-relevant priors from labeled images to unlabeled point clouds. However, existing UDA-based PCQA methods often overlook key characteristics of perceptual quality, such as sensitivity to quality ranking and quality-aware feature alignment, thereby limiting their effectiveness. To address these issues, we propose a novel Quality-aware Domain adaptation framework for PCQA, termed QD-PCQA. The framework comprises two main components: i) a Rank-weighted Conditional Alignment (RCA) strategy that aligns features under consistent quality levels and adaptively emphasizes misranked samples to reinforce perceptual quality ranking awareness; and ii) a Quality-guided Feature Augmentation (QFA) strategy, which includes quality-guided style mixup, multi-layer extension, and dual-domain augmentation modules to augment perceptual feature alignment. Extensive cross-domain experiments demonstrate that QD-PCQA significantly improves generalization in NR-PCQA tasks. The code is available at https://github.com/huhu-code/QD-PCQA.
CVSep 15, 2024
Explore the Hallucination on Low-level Perception for MLLMsYinan Sun, Zicheng Zhang, Haoning Wu et al.
The rapid development of Multi-modality Large Language Models (MLLMs) has significantly influenced various aspects of industry and daily life, showcasing impressive capabilities in visual perception and understanding. However, these models also exhibit hallucinations, which limit their reliability as AI systems, especially in tasks involving low-level visual perception and understanding. We believe that hallucinations stem from a lack of explicit self-awareness in these models, which directly impacts their overall performance. In this paper, we aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. To this end, we present QL-Bench, a benchmark settings to simulate human responses to low-level vision, investigating self-awareness in low-level visual perception through visual question answering related to low-level attributes such as clarity and lighting. Specifically, we construct the LLSAVisionQA dataset, comprising 2,990 single images and 1,999 image pairs, each accompanied by an open-ended question about its low-level features. Through the evaluation of 15 MLLMs, we demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped. Notably, for the same model, simpler questions are often answered more accurately than complex ones. However, self-awareness appears to improve when addressing more challenging questions. We hope that our benchmark will motivate further research, particularly focused on enhancing the self-awareness of MLLMs in tasks involving low-level visual perception and understanding.
CVDec 28, 2023Code
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined LevelsHaoning Wu, Zicheng Zhang, Weixia Zhang et al.
The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.