MMMay 14
Content-Adaptive Rate-Quality Curve Prediction Model in Media Processing SystemShibo Yin, Zhiyu Zhang, Peirong Ning et al.
In streaming media services, video transcoding is a common practice to alleviate bandwidth demands. Unfortunately, traditional methods employing a uniform rate factor (RF) across all videos often result in significant inefficiencies. Content-adaptive encoding (CAE) techniques address this by dynamically adjusting encoding parameters based on video content characteristics. However, existing CAE methods are often tightly coupled with specific encoding strategies, leading to inflexibility. In this paper, we propose a model that predicts both RF-quality and RF-bitrate curves, which can be utilized to derive a comprehensive bitrate-quality curve. This approach facilitates flexible adjustments to the encoding strategy without necessitating model retraining. The model leverages codec features, content features, and anchor features to predict the bitrate-quality curve accurately. Additionally, we introduce an anchor suspension method to enhance prediction accuracy. Experiments confirm that the actual quality metric (VMAF) of the compressed video stays within 1 of the target, achieving an accuracy of 99.14%. By incorporating our quality improvement strategy with the rate-quality curve prediction model, we conducted online A/B tests, obtaining both +0.107% improvements in video views and video completions and +0.064% app duration time. Our model has been deployed on the Xiaohongshu App.
IVAug 8, 2024
SG-JND: Semantic-Guided Just Noticeable Distortion Predictor For Image CompressionLinhan Cao, Wei Sun, Xiongkuo Min et al.
Just noticeable distortion (JND), representing the threshold of distortion in an image that is minimally perceptible to the human visual system (HVS), is crucial for image compression algorithms to achieve a trade-off between transmission bit rate and image quality. However, traditional JND prediction methods only rely on pixel-level or sub-band level features, lacking the ability to capture the impact of image content on JND. To bridge this gap, we propose a Semantic-Guided JND (SG-JND) network to leverage semantic information for JND prediction. In particular, SG-JND consists of three essential modules: the image preprocessing module extracts semantic-level patches from images, the feature extraction module extracts multi-layer features by utilizing the cross-scale attention layers, and the JND prediction module regresses the extracted features into the final JND value. Experimental results show that SG-JND achieves the state-of-the-art performance on two publicly available JND datasets, which demonstrates the effectiveness of SG-JND and highlight the significance of incorporating semantic information in JND assessment.
IVApr 17, 2024Code
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and ResultsXin Li, Kun Yuan, Yajing Pei et al.
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.
IVMay 14, 2024Code
Enhancing Blind Video Quality Assessment with Rich Quality-aware FeaturesWei Sun, Haoning Wu, Zicheng Zhang et al.
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{https://github.com/sunwei925/RQ-VQA.git}.
CVAug 20, 2024
Alignment-free Raw Video DemoireingShuning Xu, Xina Liu, Binbin Song et al.
Video demoireing aims to remove undesirable interference patterns that arise during the capture of screen content, restoring artifact-free frames while maintaining temporal consistency. Existing video demoireing methods typically utilize carefully designed alignment modules to estimate inter-frame motion for leveraging temporal information; however, these modules are often complex and computationally demanding. Meanwhile, recent works indicate that using raw data as input significantly enhances demoireing performance. Building on this insight, this paper introduces a novel alignment-free raw video demoireing network with frequency-assisted spatio-temporal Mamba (DemMamba). It incorporates sequentially arranged Spatial Mamba Blocks (SMB) and Temporal Mamba Blocks (TMB) to effectively model the inter- and intra-relationships in raw video demoireing. The SMB employs a multi-directional scanning mechanism coupled with a learnable frequency compressor to effectively differentiate interference patterns across various orientations and frequencies, resulting in reduced artifacts, sharper edges, and faithful texture reconstruction. Concurrently, the TMB enhances temporal consistency by performing bidirectional scanning across the temporal sequences and integrating channel attention techniques, facilitating improved temporal information fusion. Extensive experiments demonstrate that DemMamba surpasses state-of-the-art methods by 1.6 dB in PSNR, and also delivers a satisfactory visual experience.