Shan Liu

CV
h-index98
76papers
3,250citations
Novelty46%
AI Score57

76 Papers

CVApr 13, 2022Code
Neural Texture Extraction and Distribution for Controllable Person Image Synthesis

Yurui Ren, Xiaoqing Fan, Ge Li et al.

We deal with the controllable person image synthesis task which aims to re-render a human from a reference image with explicit control over body pose and appearance. Observing that person images are highly structured, we propose to generate desired images by extracting and distributing semantic entities of reference images. To achieve this goal, a neural texture extraction and distribution operation based on double attention is described. This operation first extracts semantic neural textures from reference feature maps. Then, it distributes the extracted neural textures according to the spatial distributions learned from target poses. Our model is trained to predict human images in arbitrary poses, which encourages it to extract disentangled and expressive neural textures representing the appearance of different semantic entities. The disentangled representation further enables explicit appearance control. Neural textures of different reference images can be fused to control the appearance of the interested areas. Experimental comparisons show the superiority of the proposed model. Code is available at https://github.com/RenYurui/Neural-Texture-Extraction-Distribution.

CVOct 11, 2022Code
Robust Human Matting via Semantic Guidance

Xiangguang Chen, Ye Zhu, Yu Li et al.

Automatic human matting is highly desired for many real applications. We investigate recent human matting methods and show that common bad cases happen when semantic human segmentation fails. This indicates that semantic understanding is crucial for robust human matting. From this, we develop a fast yet accurate human matting framework, named Semantic Guided Human Matting (SGHM). It builds on a semantic human segmentation network and introduces a light-weight matting module with only marginal computational cost. Unlike previous works, our framework is data efficient, which requires a small amount of matting ground-truth to learn to estimate high quality object mattes. Our experiments show that trained with merely 200 matting images, our method can generalize well to real-world datasets, and outperform recent methods on multiple benchmarks, while remaining efficient. Considering the unbearable labeling cost of matting data and widely available segmentation data, our method becomes a practical and effective solution for the task of human matting. Source code is available at https://github.com/cxgincsu/SemanticGuidedHumanMatting.

CVJul 4, 2024Code
Perception-Guided Quality Metric of 3D Point Clouds Using Hybrid Strategy

Yujie Zhang, Qi Yang, Yiling Xu et al.

Full-reference point cloud quality assessment (FR-PCQA) aims to infer the quality of distorted point clouds with available references. Most of the existing FR-PCQA metrics ignore the fact that the human visual system (HVS) dynamically tackles visual information according to different distortion levels (i.e., distortion detection for high-quality samples and appearance perception for low-quality samples) and measure point cloud quality using unified features. To bridge the gap, in this paper, we propose a perception-guided hybrid metric (PHM) that adaptively leverages two visual strategies with respect to distortion degree to predict point cloud quality: to measure visible difference in high-quality samples, PHM takes into account the masking effect and employs texture complexity as an effective compensatory factor for absolute difference; on the other hand, PHM leverages spectral graph theory to evaluate appearance degradation in low-quality samples. Variations in geometric signals on graphs and changes in the spectral graph wavelet coefficients are utilized to characterize geometry and texture appearance degradation, respectively. Finally, the results obtained from the two components are combined in a non-linear method to produce an overall quality score of the tested point cloud. The results of the experiment on five independent databases show that PHM achieves state-of-the-art (SOTA) performance and offers significant performance improvement in multiple distortion environments. The code is publicly available at https://github.com/zhangyujie-1998/PHM.

CVJun 15, 2022
Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction

Zhihao Hu, Guo Lu, Jinyang Guo et al.

The previous deep video compression approaches only use the single scale motion compensation strategy and rarely adopt the mode prediction technique from the traditional standards like H.264/H.265 for both motion and residual compression. In this work, we first propose a coarse-to-fine (C2F) deep video compression framework for better motion compensation, in which we perform motion estimation, compression and compensation twice in a coarse to fine manner. Our C2F framework can achieve better motion compensation results without significantly increasing bit costs. Observing hyperprior information (i.e., the mean and variance values) from the hyperprior networks contains discriminant statistical information of different patches, we also propose two efficient hyperprior-guided mode prediction methods. Specifically, using hyperprior information as the input, we propose two mode prediction networks to respectively predict the optimal block resolutions for better motion coding and decide whether to skip residual information from each block for better residual coding without introducing additional bit cost while bringing negligible extra computation cost. Comprehensive experimental results demonstrate our proposed C2F video compression framework equipped with the new hyperprior-guided mode prediction methods achieves the state-of-the-art performance on HEVC, UVG and MCL-JCV datasets.

CVFeb 22, 2023
S3I-PointHop: SO(3)-Invariant PointHop for 3D Point Cloud Classification

Pranav Kadam, Hardik Prajapati, Min Zhang et al.

Many point cloud classification methods are developed under the assumption that all point clouds in the dataset are well aligned with the canonical axes so that the 3D Cartesian point coordinates can be employed to learn features. When input point clouds are not aligned, the classification performance drops significantly. In this work, we focus on a mathematically transparent point cloud classification method called PointHop, analyze its reason for failure due to pose variations, and solve the problem by replacing its pose dependent modules with rotation invariant counterparts. The proposed method is named SO(3)-Invariant PointHop (or S3I-PointHop in short). We also significantly simplify the PointHop pipeline using only one single hop along with multiple spatial aggregation techniques. The idea of exploiting more spatial information is novel. Experiments on the ModelNet40 dataset demonstrate the superiority of S3I-PointHop over traditional PointHop-like methods.

CVMar 20, 2023
A Tiny Machine Learning Model for Point Cloud Object Classification

Min Zhang, Jintang Xue, Pranav Kadam et al.

The design of a tiny machine learning model, which can be deployed in mobile and edge devices, for point cloud object classification is investigated in this work. To achieve this objective, we replace the multi-scale representation of a point cloud object with a single-scale representation for complexity reduction, and exploit rich 3D geometric information of a point cloud object for performance improvement. The proposed solution is named Green-PointHop due to its low computational complexity. We evaluate the performance of Green-PointHop on ModelNet40 and ScanObjectNN two datasets. Green-PointHop has a model size of 64K parameters. It demands 2.3M floating-point operations (FLOPs) to classify a ModelNet40 object of 1024 down-sampled points. Its classification performance gaps against the state-of-the-art DGCNN method are 3% and 7% for ModelNet40 and ScanObjectNN, respectively. On the other hand, the model size and inference complexity of DGCNN are 42X and 1203X of those of Green-PointHop, respectively.

CVJun 1, 2023
Reconstruction Distortion of Learned Image Compression with Imperceptible Perturbations

Yang Sui, Zhuohang Li, Ding Ding et al.

Learned Image Compression (LIC) has recently become the trending technique for image transmission due to its notable performance. Despite its popularity, the robustness of LIC with respect to the quality of image reconstruction remains under-explored. In this paper, we introduce an imperceptible attack approach designed to effectively degrade the reconstruction quality of LIC, resulting in the reconstructed image being severely disrupted by noise where any object in the reconstructed images is virtually impossible. More specifically, we generate adversarial examples by introducing a Frobenius norm-based loss function to maximize the discrepancy between original images and reconstructed adversarial examples. Further, leveraging the insensitivity of high-frequency components to human vision, we introduce Imperceptibility Constraint (IC) to ensure that the perturbations remain inconspicuous. Experiments conducted on the Kodak dataset using various LIC models demonstrate effectiveness. In addition, we provide several findings and suggestions for designing future defenses.

CVOct 29, 2022
GPA-Net:No-Reference Point Cloud Quality Assessment with Multi-task Graph Convolutional Network

Ziyu Shan, Qi Yang, Rui Ye et al.

With the rapid development of 3D vision, point cloud has become an increasingly popular 3D visual media content. Due to the irregular structure, point cloud has posed novel challenges to the related research, such as compression, transmission, rendering and quality assessment. In these latest researches, point cloud quality assessment (PCQA) has attracted wide attention due to its significant role in guiding practical applications, especially in many cases where the reference point cloud is unavailable. However, current no-reference metrics which based on prevalent deep neural network have apparent disadvantages. For example, to adapt to the irregular structure of point cloud, they require preprocessing such as voxelization and projection that introduce extra distortions, and the applied grid-kernel networks, such as Convolutional Neural Networks, fail to extract effective distortion-related features. Besides, they rarely consider the various distortion patterns and the philosophy that PCQA should exhibit shifting, scaling, and rotational invariance. In this paper, we propose a novel no-reference PCQA metric named the Graph convolutional PCQA network (GPA-Net). To extract effective features for PCQA, we propose a new graph convolution kernel, i.e., GPAConv, which attentively captures the perturbation of structure and texture. Then, we propose the multi-task framework consisting of one main task (quality regression) and two auxiliary tasks (distortion type and degree predictions). Finally, we propose a coordinate normalization module to stabilize the results of GPAConv under shift, scale and rotation transformations. Experimental results on two independent databases show that GPA-Net achieves the best performance compared to the state-of-the-art no-reference PCQA metrics, even better than some full-reference metrics in some cases.

CVAug 3, 2023
TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations

Qi Yang, Joel Jung, Timon Deschamps et al.

Dynamic colored meshes (DCM) are widely used in various applications; however, these meshes may undergo different processes, such as compression or transmission, which can distort them and degrade their quality. To facilitate the development of objective metrics for DCMs and study the influence of typical distortions on their perception, we create the Tencent - dynamic colored mesh database (TDMD) containing eight reference DCM objects with six typical distortions. Using processed video sequences (PVS) derived from the DCM, we have conducted a large-scale subjective experiment that resulted in 303 distorted DCM samples with mean opinion scores, making the TDMD the largest available DCM database to our knowledge. This database enabled us to study the impact of different types of distortion on human perception and offer recommendations for DCM compression and related tasks. Additionally, we have evaluated three types of state-of-the-art objective metrics on the TDMD, including image-based, point-based, and video-based metrics, on the TDMD. Our experimental results highlight the strengths and weaknesses of each metric, and we provide suggestions about the selection of metrics in practical DCM applications. The TDMD will be made publicly available at the following location: https://multimedia.tencent.com/resources/tdmd.

IVAug 17, 2023
Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression

Huairui Wang, Nianxiang Fu, Zhenzhong Chen et al.

Learned image compression methods have shown superior rate-distortion performance and remarkable potential compared to traditional compression methods. Most existing learned approaches use stacked convolution or window-based self-attention for transform coding, which aggregate spatial information in a fixed range. In this paper, we focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. With the adaptive aggregation strategy and the sharing weights mechanism, our method can achieve promising transform capability with acceptable model complexity. Besides, according to the recent progress of entropy model, we define a generalized coarse-to-fine entropy model, considering the coarse global context, the channel-wise, and the spatial context. Based on it, we introduce dynamic kernel in hyper-prior to generate more expressive global context. Furthermore, we propose an asymmetric spatial-channel entropy model according to the investigation of the spatial characteristics of the grouped latents. The asymmetric entropy model aims to reduce statistical redundancy while maintaining coding efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.

MEOct 30, 2022
Changes from Classical Statistics to Modern Statistics and Data Science

Kai Zhang, Shan Liu, Momiao Xiong

A coordinate system is a foundation for every quantitative science, engineering, and medicine. Classical physics and statistics are based on the Cartesian coordinate system. The classical probability and hypothesis testing theory can only be applied to Euclidean data. However, modern data in the real world are from natural language processing, mathematical formulas, social networks, transportation and sensor networks, computer visions, automations, and biomedical measurements. The Euclidean assumption is not appropriate for non Euclidean data. This perspective addresses the urgent need to overcome those fundamental limitations and encourages extensions of classical probability theory and hypothesis testing , diffusion models and stochastic differential equations from Euclidean space to non Euclidean space. Artificial intelligence such as natural language processing, computer vision, graphical neural networks, manifold regression and inference theory, manifold learning, graph neural networks, compositional diffusion models for automatically compositional generations of concepts and demystifying machine learning systems, has been rapidly developed. Differential manifold theory is the mathematic foundations of deep learning and data science as well. We urgently need to shift the paradigm for data analysis from the classical Euclidean data analysis to both Euclidean and non Euclidean data analysis and develop more and more innovative methods for describing, estimating and inferring non Euclidean geometries of modern real datasets. A general framework for integrated analysis of both Euclidean and non Euclidean data, composite AI, decision intelligence and edge AI provide powerful innovative ideas and strategies for fundamentally advancing AI. We are expected to marry statistics with AI, develop a unified theory of modern statistics and drive next generation of AI and data science.

CVSep 27, 2023
SJTU-TMQA: A quality assessment database for static mesh with texture map

Bingyang Cui, Qi Yang, Kaifa Yang et al.

In recent years, static meshes with texture maps have become one of the most prevalent digital representations of 3D shapes in various applications, such as animation, gaming, medical imaging, and cultural heritage applications. However, little research has been done on the quality assessment of textured meshes, which hinders the development of quality-oriented applications, such as mesh compression and enhancement. In this paper, we create a large-scale textured mesh quality assessment database, namely SJTU-TMQA, which includes 21 reference meshes and 945 distorted samples. The meshes are rendered into processed video sequences and then conduct subjective experiments to obtain mean opinion scores (MOS). The diversity of content and accuracy of MOS has been shown to validate its heterogeneity and reliability. The impact of various types of distortion on human perception is demonstrated. 13 state-of-the-art objective metrics are evaluated on SJTU-TMQA. The results report the highest correlation of around 0.6, indicating the need for more effective objective metrics. The SJTU-TMQA is available at https://ccccby.github.io

CVSep 30, 2022
Point Cloud Quality Assessment using 3D Saliency Maps

Zhengyu Wang, Yujie Zhang, Qi Yang et al.

Point cloud quality assessment (PCQA) has become an appealing research field in recent days. Considering the importance of saliency detection in quality assessment, we propose an effective full-reference PCQA metric which makes the first attempt to utilize the saliency information to facilitate quality prediction, called point cloud quality assessment using 3D saliency maps (PQSM). Specifically, we first propose a projection-based point cloud saliency map generation method, in which depth information is introduced to better reflect the geometric characteristics of point clouds. Then, we construct point cloud local neighborhoods to derive three structural descriptors to indicate the geometry, color and saliency discrepancies. Finally, a saliency-based pooling strategy is proposed to generate the final quality score. Extensive experiments are performed on four independent PCQA databases. The results demonstrate that the proposed PQSM shows competitive performances compared to multiple state-of-the-art PCQA metrics.

CVAug 3, 2023
TSMD: A Database for Static Color Mesh Quality Assessment Study

Qi Yang, Joel Jung, Haiqiang Wang et al.

Static meshes with texture map are widely used in modern industrial and manufacturing sectors, attracting considerable attention in the mesh compression community due to its huge amount of data. To facilitate the study of static mesh compression algorithm and objective quality metric, we create the Tencent - Static Mesh Dataset (TSMD) containing 42 reference meshes with rich visual characteristics. 210 distorted samples are generated by the lossy compression scheme developed for the Call for Proposals on polygonal static mesh coding, released on June 23 by the Alliance for Open Media Volumetric Visual Media group. Using processed video sequences, a large-scale, crowdsourcing-based, subjective experiment was conducted to collect subjective scores from 74 viewers. The dataset undergoes analysis to validate its sample diversity and Mean Opinion Scores (MOS) accuracy, establishing its heterogeneous nature and reliability. State-of-the-art objective metrics are evaluated on the new dataset. Pearson and Spearman correlations around 0.75 are reported, deviating from results typically observed on less heterogeneous datasets, demonstrating the need for further development of more robust metrics. The TSMD, including meshes, PVSs, bitstreams, and MOS, is made publicly available at the following location: https://multimedia.tencent.com/resources/tsmd.

CVAug 9, 2023
GeodesicPSIM: Predicting the Quality of Static Mesh with Texture Map via Geodesic Patch Similarity

Qi Yang, Joel Jung, Xiaozhong Xu et al.

Static meshes with texture maps have attracted considerable attention in both industrial manufacturing and academic research, leading to an urgent requirement for effective and robust objective quality evaluation. However, current model-based static mesh quality metrics have obvious limitations: most of them only consider geometry information, while color information is ignored, and they have strict constraints for the meshes' geometrical topology. Other metrics, such as image-based and point-based metrics, are easily influenced by the prepossessing algorithms, e.g., projection and sampling, hampering their ability to perform at their best. In this paper, we propose Geodesic Patch Similarity (GeodesicPSIM), a novel model-based metric to accurately predict human perception quality for static meshes. After selecting a group keypoints, 1-hop geodesic patches are constructed based on both the reference and distorted meshes cleaned by an effective mesh cleaning algorithm. A two-step patch cropping algorithm and a patch texture mapping module refine the size of 1-hop geodesic patches and build the relationship between the mesh geometry and color information, resulting in the generation of 1-hop textured geodesic patches. Three types of features are extracted to quantify the distortion: patch color smoothness, patch discrete mean curvature, and patch pixel color average and variance. To the best of our knowledge, GeodesicPSIM is the first model-based metric especially designed for static meshes with texture maps. GeodesicPSIM provides state-of-the-art performance in comparison with image-based, point-based, and video-based metrics on a newly created and challenging database. We also prove the robustness of GeodesicPSIM by introducing different settings of hyperparameters. Ablation studies also exhibit the effectiveness of three proposed features and the patch cropping algorithm.

CVJul 18, 2022
Learning Knowledge Representation with Meta Knowledge Distillation for Single Image Super-Resolution

Han Zhu, Zhenzhong Chen, Shan Liu

Knowledge distillation (KD), which can efficiently transfer knowledge from a cumbersome network (teacher) to a compact network (student), has demonstrated its advantages in some computer vision applications. The representation of knowledge is vital for knowledge transferring and student learning, which is generally defined in hand-crafted manners or uses the intermediate features directly. In this paper, we propose a model-agnostic meta knowledge distillation method under the teacher-student architecture for the single image super-resolution task. It provides a more flexible and accurate way to help the teachers transmit knowledge in accordance with the abilities of students via knowledge representation networks (KRNets) with learnable parameters. In order to improve the perception ability of knowledge representation to students' requirements, we propose to solve the transformation process from intermediate outputs to transferred knowledge by employing the student features and the correlation between teacher and student in the KRNets. Specifically, the texture-aware dynamic kernels are generated and then extract texture features to be improved and the corresponding teacher guidance so as to decompose the distillation problem into texture-wise supervision for further promoting the recovery quality of high-frequency details. In addition, the KRNets are optimized in a meta-learning manner to ensure the knowledge transferring and the student learning are beneficial to improving the reconstructed quality of the student. Experiments conducted on various single image super-resolution datasets demonstrate that our proposed method outperforms existing defined knowledge representation related distillation methods, and can help super-resolution algorithms achieve better reconstruction quality without introducing any inference complexity.

IVNov 29, 2023
Corner-to-Center Long-range Context Model for Efficient Learned Image Compression

Yang Sui, Ding Ding, Xiang Pan et al.

In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations. To reduce the decoding time resulting from the serial autoregressive context model, the parallel context model has been proposed as an alternative that necessitates only two passes during the decoding phase, thus facilitating efficient image compression in real-world scenarios. However, performance degradation occurs due to its incomplete casual context. To tackle this issue, we conduct an in-depth analysis of the performance degradation observed in existing parallel context models, focusing on two aspects: the Quantity and Quality of information utilized for context prediction and decoding. Based on such analysis, we propose the \textbf{Corner-to-Center transformer-based Context Model (C$^3$M)} designed to enhance context and latent predictions and improve rate-distortion performance. Specifically, we leverage the logarithmic-based prediction order to predict more context features from corner to center progressively. In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder to capture the long-range semantic information by assigning the different window shapes in different channels. Extensive experimental evaluations show that the proposed method is effective and outperforms the state-of-the-art parallel methods. Finally, according to the subjective analysis, we suggest that improving the detailed representation in transformer-based image compression is a promising direction to be explored.

IVJul 8, 2022
FAIVConf: Face enhancement for AI-based Video Conference with Low Bit-rate

Zhengang Li, Sheng Lin, Shan Liu et al.

Recently, high-quality video conferencing with fewer transmission bits has become a very hot and challenging problem. We propose FAIVConf, a specially designed video compression framework for video conferencing, based on the effective neural human face generation techniques. FAIVConf brings together several designs to improve the system robustness in real video conference scenarios: face-swapping to avoid artifacts in background animation; facial blurring to decrease transmission bit-rate and maintain the quality of extracted facial landmarks; and dynamic source update for face view interpolation to accommodate a large range of head poses. Our method achieves a significant bit-rate reduction in the video conference and gives much better visual quality under the same bit-rate compared with H.264 and H.265 coding schemes.

CVFeb 27, 2023
PointFlowHop: Green and Interpretable Scene Flow Estimation from Consecutive Point Clouds

Pranav Kadam, Jiahao Gu, Shan Liu et al.

An efficient 3D scene flow estimation method called PointFlowHop is proposed in this work. PointFlowHop takes two consecutive point clouds and determines the 3D flow vectors for every point in the first point cloud. PointFlowHop decomposes the scene flow estimation task into a set of subtasks, including ego-motion compensation, object association and object-wise motion estimation. It follows the green learning (GL) pipeline and adopts the feedforward data processing path. As a result, its underlying mechanism is more transparent than deep-learning (DL) solutions based on end-to-end optimization of network parameters. We conduct experiments on the stereoKITTI and the Argoverse LiDAR point cloud datasets and demonstrate that PointFlowHop outperforms deep-learning methods with a small model size and less training time. Furthermore, we compare the Floating Point Operations (FLOPs) required by PointFlowHop and other learning-based methods in inference, and show its big savings in computational complexity.

CLJul 20, 2023
Layer-wise Representation Fusion for Compositional Generalization

Yafang Zheng, Lei Lin, Shuangtao Li et al.

Existing neural models are demonstrated to struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. A key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled. However, previous work concentrates on separating the learning of syntax and semantics instead of exploring the reasons behind the representation entanglement (RE) problem to solve it. We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers. We find that the ``shallow'' residual connections within each layer fail to fuse previous layers' information effectively, leading to information forgetting between layers and further the RE problems. Inspired by this, we propose LRF, a novel \textbf{L}ayer-wise \textbf{R}epresentation \textbf{F}usion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process effectively through introducing a \emph{fuse-attention module} at each encoder and decoder layer. LRF achieves promising results on two realistic benchmarks, empirically demonstrating the effectiveness of our proposal.

IVApr 17, 2024Code
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Xin Li, Kun Yuan, Yajing Pei et al.

This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.

IVAug 13, 2024
BVI-UGC: A Video Quality Database for User-Generated Content Transcoding

Zihao Qi, Chen Feng, Fan Zhang et al.

In recent years, user-generated content (UGC) has become one of the major video types consumed via streaming networks. Numerous research contributions have focused on assessing its visual quality through subjective tests and objective modeling. In most cases, objective assessments are based on a no-reference scenario, where the corresponding reference content is assumed not to be available. However, full-reference video quality assessment is also important for UGC in the delivery pipeline, particularly associated with the video transcoding process. In this context, we present a new UGC video quality database, BVI-UGC, for user-generated content transcoding, which contains 60 (non-pristine) reference videos and 1,080 test sequences. In this work, we simulated the creation of non-pristine reference sequences (with a wide range of compression distortions), typical of content uploaded to UGC platforms for transcoding. A comprehensive crowdsourced subjective study was then conducted involving more than 3,500 human participants. Based on this collected subjective data, we benchmarked the performance of 10 full-reference and 11 no-reference quality metrics. Our results demonstrate the poor performance (SROCC values are lower than 0.6) of these metrics in predicting the perceptual quality of UGC in two different scenarios (with or without a reference).

IVApr 17, 2025Code
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

Xin Li, Kun Yuan, Bingchen Li et al.

This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.

99.0AIMay 15
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Jingxuan Wei, Xi Bai, Shan Liu et al.

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

LGMar 1, 2024Code
Graph Construction with Flexible Nodes for Traffic Demand Prediction

Jinyan Hou, Shan Liu, Ya Zhang et al.

Graph neural networks (GNNs) have been widely applied in traffic demand prediction, and transportation modes can be divided into station-based mode and free-floating traffic mode. Existing research in traffic graph construction primarily relies on map matching to construct graphs based on the road network. However, the complexity and inhomogeneity of data distribution in free-floating traffic demand forecasting make road network matching inflexible. To tackle these challenges, this paper introduces a novel graph construction method tailored to free-floating traffic mode. We propose a novel density-based clustering algorithm (HDPC-L) to determine the flexible positioning of nodes in the graph, overcoming the computational bottlenecks of traditional clustering algorithms and enabling effective handling of large-scale datasets. Furthermore, we extract valuable information from ridership data to initialize the edge weights of GNNs. Comprehensive experiments on two real-world datasets, the Shenzhen bike-sharing dataset and the Haikou ride-hailing dataset, show that the method significantly improves the performance of the model. On average, our models show an improvement in accuracy of around 25\% and 19.5\% on the two datasets. Additionally, it significantly enhances computational efficiency, reducing training time by approximately 12% and 32.5% on the two datasets. We make our code available at https://github.com/houjinyan/HDPC-L-ODInit.

CLMay 20, 2023Code
Learning to Compose Representations of Different Encoder Layers towards Improving Compositional Generalization

Lei Lin, Shuangtao Li, Yafang Zheng et al.

Recent studies have shown that sequence-to-sequence (seq2seq) models struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. There is mounting evidence that one of the reasons hindering CG is the representation of the encoder uppermost layer is entangled, i.e., the syntactic and semantic representations of sequences are entangled. However, we consider that the previously identified representation entanglement problem is not comprehensive enough. Additionally, we hypothesize that the source keys and values representations passing into different decoder layers are also entangled. Starting from this intuition, we propose \textsc{CompoSition} (\textbf{Compo}se \textbf{S}yntactic and Semant\textbf{i}c Representa\textbf{tion}s), an extension to seq2seq models which learns to compose representations of different encoder layers dynamically for different tasks, since recent studies reveal that the bottom layers of the Transformer encoder contain more syntactic information and the top ones contain more semantic information. Specifically, we introduce a \textit{composed layer} between the encoder and decoder to compose different encoder layers' representations to generate specific keys and values passing into different decoder layers. \textsc{CompoSition} achieves competitive results on two comprehensive and realistic benchmarks, which empirically demonstrates the effectiveness of our proposal. Codes are available at~\url{https://github.com/thinkaboutzero/COMPOSITION}.

CVFeb 12, 2022Code
OctAttention: Octree-Based Large-Scale Contexts Model for Point Cloud Compression

Chunyang Fu, Ge Li, Rui Song et al.

In point cloud compression, sufficient contexts are significant for modeling the point cloud distribution. However, the contexts gathered by the previous voxel-based methods decrease when handling sparse point clouds. To address this problem, we propose a multiple-contexts deep learning framework called OctAttention employing the octree structure, a memory-efficient representation for point clouds. Our approach encodes octree symbol sequences in a lossless way by gathering the information of sibling and ancestor nodes. Expressly, we first represent point clouds with octree to reduce spatial redundancy, which is robust for point clouds with different resolutions. We then design a conditional entropy model with a large receptive field that models the sibling and ancestor contexts to exploit the strong dependency among the neighboring nodes and employ an attention mechanism to emphasize the correlated nodes in the context. Furthermore, we introduce a mask operation during training and testing to make a trade-off between encoding time and performance. Compared to the previous state-of-the-art works, our approach obtains a 10%-35% BD-Rate gain on the LiDAR benchmark (e.g. SemanticKITTI) and object point cloud dataset (e.g. MPEG 8i, MVUB), and saves 95% coding time compared to the voxel-based baseline. The code is available at https://github.com/zb12138/OctAttention.

CVSep 17, 2021Code
PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

Yurui Ren, Ge Li, Yuanqi Chen et al.

Generating portrait images by controlling the motions of existing faces is an important task of great consequence to social media industries. For easy use and intuitive control, semantically meaningful and fully disentangled parameters should be used as modifications. However, many existing techniques do not provide such fine-grained controls or use indirect editing methods i.e. mimic motions of other individuals. In this paper, a Portrait Image Neural Renderer (PIRenderer) is proposed to control the face motions with the parameters of three-dimensional morphable face models (3DMMs). The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications. Experiments on both direct and indirect editing tasks demonstrate the superiority of this model. Meanwhile, we further extend this model to tackle the audio-driven facial reenactment task by extracting sequential motions from audio inputs. We show that our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream. Our source code is available at https://github.com/RenYurui/PIRender.

CVDec 10, 2020Code
SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains

Yuanqi Chen, Ge Li, Cece Jin et al.

This paper observes that there is an issue of high frequencies missing in the discriminator of standard GAN, and we reveal it stems from downsampling layers employed in the network architecture. This issue makes the generator lack the incentive from the discriminator to learn high-frequency content of data, resulting in a significant spectrum discrepancy between generated images and real images. Since the Fourier transform is a bijective mapping, we argue that reducing this spectrum discrepancy would boost the performance of GANs. To this end, we introduce SSD-GAN, an enhancement of GANs to alleviate the spectral information loss in the discriminator. Specifically, we propose to embed a frequency-aware classifier into the discriminator to measure the realness of the input in both the spatial and spectral domains. With the enhanced discriminator, the generator of SSD-GAN is encouraged to learn high-frequency content of real data and generate exact details. The proposed method is general and can be easily integrated into most existing GANs framework without excessive cost. The effectiveness of SSD-GAN is validated on various network architectures, objective functions, and datasets. Code will be available at https://github.com/cyq373/SSD-GAN.

CVAug 27, 2020Code
Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation

Yurui Ren, Ge Li, Shan Liu et al.

Pose-guided person image generation and animation aim to transform a source person image to target poses. These tasks require spatial manipulation of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. This framework first estimates global flow fields between sources and targets. Then, corresponding local source feature patches are sampled with content-aware local attention coefficients. We show that our framework can spatially transform the inputs in an efficient manner. Meanwhile, we further model the temporal consistency for the person image animation task to generate coherent videos. The experiment results of both image generation and animation tasks demonstrate the superiority of our model. Besides, additional results of novel view synthesis and face image animation show that our model is applicable to other tasks requiring spatial transformation. The source code of our project is available at https://github.com/RenYurui/Global-Flow-Local-Attention.

CVMar 15, 2024
Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

Ziyu Shan, Yujie Zhang, Qi Yang et al.

No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference, which have achieved tremendous improvements due to the utilization of deep neural networks. However, learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem, we propose a novel contrastive pre-training framework tailored for PCQA (CoPA), which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space, we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors, we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore, in the model fine-tuning stage, we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models.

CVApr 24, 2024
AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results

Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman et al.

This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed methods must process 30 FHD frames under 1 second. In the challenge, a total of 102 participants registered, and 15 submitted code and models. The performance of the top-5 submissions is reviewed and provided here as a survey of diverse deep models for efficient video quality assessment of user-generated content.

CVMar 15, 2024
PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment

Ziyu Shan, Yujie Zhang, Qi Yang et al.

No-reference point cloud quality assessment (NR-PCQA) aims to automatically predict the perceptual quality of point clouds without reference, which has achieved remarkable performance due to the utilization of deep learning-based models. However, these data-driven models suffer from the scarcity of labeled data and perform unsatisfactorily in cross-dataset evaluations. To address this problem, we propose a self-supervised pre-training framework using masked autoencoders (PAME) to help the model learn useful representations without labels. Specifically, after projecting point clouds into images, our PAME employs dual-branch autoencoders, reconstructing masked patches from distorted images into the original patches within reference and distorted images. In this manner, the two branches can separately learn content-aware features and distortion-aware features from the projected images. Furthermore, in the model fine-tuning stage, the learned content-aware features serve as a guide to fuse the point cloud quality features extracted from different perspectives. Extensive experiments show that our method outperforms the state-of-the-art NR-PCQA methods on popular benchmarks in terms of prediction accuracy and generalizability.

IVApr 28, 2024
Joint Reference Frame Synthesis and Post Filter Enhancement for Versatile Video Coding

Weijie Bao, Yuantong Zhang, Jianghao Jia et al.

This paper presents the joint reference frame synthesis (RFS) and post-processing filter enhancement (PFE) for Versatile Video Coding (VVC), aiming to explore the combination of different neural network-based video coding (NNVC) tools to better utilize the hierarchical bi-directional coding structure of VVC. Both RFS and PFE utilize the Space-Time Enhancement Network (STENet), which receives two input frames with artifacts and produces two enhanced frames with suppressed artifacts, along with an intermediate synthesized frame. STENet comprises two pipelines, the synthesis pipeline and the enhancement pipeline, tailored for different purposes. During RFS, two reconstructed frames are sent into STENet's synthesis pipeline to synthesize a virtual reference frame, similar to the current to-be-coded frame. The synthesized frame serves as an additional reference frame inserted into the reference picture list (RPL). During PFE, two reconstructed frames are fed into STENet's enhancement pipeline to alleviate their artifacts and distortions, resulting in enhanced frames with reduced artifacts and distortions. To reduce inference complexity, we propose joint inference of RFS and PFE (JISE), achieved through a single execution of STENet. Integrated into the VVC reference software VTM-15.0, RFS, PFE, and JISE are coordinated within a novel Space-Time Enhancement Window (STEW) under Random Access (RA) configuration. The proposed method could achieve -7.34%/-17.21%/-16.65% PSNR-based BD-rate on average for three components under RA configuration.

CVMay 27, 2025
Instance Data Condensation for Image Super-Resolution

Tianhao Peng, Ho Man Kwan, Yuxuan Jiang et al.

Deep learning based image Super-Resolution (ISR) relies on large training datasets to optimize model generalization; this requires substantial computational and storage resources during training. While dataset condensation has shown potential in improving data efficiency and privacy for high-level computer vision tasks, it has not yet been fully exploited for ISR. In this paper, we propose a novel Instance Data Condensation (IDC) framework specifically for ISR, which achieves instance-level data condensation through Random Local Fourier Feature Extraction and Multi-level Feature Distribution Matching. This aims to optimize feature distributions at both global and local levels and obtain high-quality synthesized training content with fine detail. This framework has been utilized to condense the most commonly used training dataset for ISR, DIV2K, with a 10% condensation rate. The resulting synthetic dataset offers comparable or (in certain cases) even better performance compared to the original full dataset and excellent training stability when used to train various popular ISR models. To the best of our knowledge, this is the first time that a condensed/synthetic dataset (with a 10% data volume) has demonstrated such performance. The source code and the synthetic dataset have been made available at https://github.com/.

LGJan 27
Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning

Tongxi Wang, Zhuoyang Xia, Xinran Chen et al.

Real-world reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift (thus slow recovery), and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We prove that entropy scheduling under non-stationarity can be reduced to a one-dimensional, round-by-round trade-off, faster tracking of the optimal solution after drift vs. avoiding gratuitous randomness when the environment is stable, so exploration strength can be driven by measurable online drift signals. Building on this, we propose AES (Adaptive Entropy Scheduling), which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.

AINov 19, 2025
SOLID: a Framework of Synergizing Optimization and LLMs for Intelligent Decision-Making

Yinsheng Wang, Tario G You, Léonard Boussioux et al.

This paper introduces SOLID (Synergizing Optimization and Large Language Models for Intelligent Decision-Making), a novel framework that integrates mathematical optimization with the contextual capabilities of large language models (LLMs). SOLID facilitates iterative collaboration between optimization and LLMs agents through dual prices and deviation penalties. This interaction improves the quality of the decisions while maintaining modularity and data privacy. The framework retains theoretical convergence guarantees under convexity assumptions, providing insight into the design of LLMs prompt. To evaluate SOLID, we applied it to a stock portfolio investment case with historical prices and financial news as inputs. Empirical results demonstrate convergence under various scenarios and indicate improved annualized returns compared to a baseline optimizer-only method, validating the synergy of the two agents. SOLID offers a promising framework for advancing automated and intelligent decision-making across diverse domains.

IVOct 13, 2025
An Overview of the JPEG AI Learning-Based Image Coding Standard

Semih Esenlik, Yaojun Wu, Zhaobin Zhang et al.

JPEG AI is an emerging learning-based image coding standard developed by Joint Photographic Experts Group (JPEG). The scope of the JPEG AI is the creation of a practical learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization and machine consumption. Scheduled for completion in early 2025, the first version of JPEG AI focuses on human vision tasks, demonstrating significant BD-rate reductions compared to existing standards, in terms of MS-SSIM, FSIM, VIF, VMAF, PSNR-HVS, IW-SSIM and NLPD quality metrics. Designed to ensure broad interoperability, JPEG AI incorporates various design features to support deployment across diverse devices and applications. This paper provides an overview of the technical features and characteristics of the JPEG AI standard.

CVJun 15, 2025
Cross-architecture universal feature coding via distribution alignment

Changsheng Gao, Shan Liu, Feng Wu et al.

Feature coding has become increasingly important in scenarios where semantic representations rather than raw pixels are transmitted and stored. However, most existing methods are architecture-specific, targeting either CNNs or Transformers. This design limits their applicability in real-world scenarios where features from both architectures coexist. To address this gap, we introduce a new research problem: cross-architecture universal feature coding (CAUFC), which seeks to build a unified codec that can effectively compress features from heterogeneous architectures. To tackle this challenge, we propose a two-step distribution alignment method. First, we design the format alignment method that unifies CNN and Transformer features into a consistent 2D token format. Second, we propose the feature value alignment method that harmonizes statistical distributions via truncation and normalization. As a first attempt to study CAUFC, we evaluate our method on the image classification task. Experimental results demonstrate that our method achieves superior rate-accuracy trade-offs compared to the architecture-specific baseline. This work marks an initial step toward universal feature compression across heterogeneous model architectures.

CVJan 6, 2024
Transferable Learned Image Compression-Resistant Adversarial Perturbations

Yang Sui, Zhuohang Li, Ding Ding et al.

Adversarial attacks can readily disrupt the image classification system, revealing the vulnerability of DNN-based recognition tasks. While existing adversarial perturbations are primarily applied to uncompressed images or compressed images by the traditional image compression method, i.e., JPEG, limited studies have investigated the robustness of models for image classification in the context of DNN-based image compression. With the rapid evolution of advanced image compression, DNN-based learned image compression has emerged as the promising approach for transmitting images in many security-critical applications, such as cloud-based face recognition and autonomous driving, due to its superior performance over traditional compression. Therefore, there is a pressing need to fully investigate the robustness of a classification system post-processed by learned image compression. To bridge this research gap, we explore the adversarial attack on a new pipeline that targets image classification models that utilize learned image compressors as pre-processing modules. Furthermore, to enhance the transferability of perturbations across various quality levels and architectures of learned image compression models, we introduce a saliency score-based sampling method to enable the fast generation of transferable perturbation. Extensive experiments with popular attack methods demonstrate the enhanced transferability of our proposed method when attacking images that have been post-processed with different learned image compression models.

CVDec 11, 2023
Point Cloud Semantic Segmentation with Sparse and Inhomogeneous Annotations

Zhiyi Pan, Nan Zhang, Wei Gao et al.

Utilizing uniformly distributed sparse annotations, weakly supervised learning alleviates the heavy reliance on fine-grained annotations in point cloud semantic segmentation tasks. However, few works discuss the inhomogeneity of sparse annotations, albeit it is common in real-world scenarios. Therefore, this work introduces the probability density function into the gradient sampling approximation method to qualitatively analyze the impact of annotation sparsity and inhomogeneity under weakly supervised learning. Based on our analysis, we propose an Adaptive Annotation Distribution Network (AADNet) capable of robust learning on arbitrarily distributed sparse annotations. Specifically, we propose a label-aware point cloud downsampling strategy to increase the proportion of annotations involved in the training stage. Furthermore, we design the multiplicative dynamic entropy as the gradient calibration function to mitigate the gradient bias caused by non-uniformly distributed sparse annotations and explicitly reduce the epistemic uncertainty. Without any prior restrictions and additional information, our proposed method achieves comprehensive performance improvements at multiple label rates and different annotation distributions.

CVMay 16, 2023
PanelNet: Understanding 360 Indoor Environment via Panel Representation

Haozheng Yu, Lu He, Bing Jian et al.

Indoor 360 panoramas have two essential properties. (1) The panoramas are continuous and seamless in the horizontal direction. (2) Gravity plays an important role in indoor environment design. By leveraging these properties, we present PanelNet, a framework that understands indoor environments using a novel panel representation of 360 images. We represent an equirectangular projection (ERP) as consecutive vertical panels with corresponding 3D panel geometry. To reduce the negative impact of panoramic distortion, we incorporate a panel geometry embedding network that encodes both the local and global geometric features of a panel. To capture the geometric context in room design, we introduce Local2Global Transformer, which aggregates local information within a panel and panel-wise global context. It greatly improves the model performance with low training overhead. Our method outperforms existing methods on indoor 360 depth estimation and shows competitive results against state-of-the-art approaches on the task of indoor layout estimation and semantic segmentation.

CVFeb 16, 2022
PCRP: Unsupervised Point Cloud Object Retrieval and Pose Estimation

Pranav Kadam, Qingyang Zhou, Shan Liu et al.

An unsupervised point cloud object retrieval and pose estimation method, called PCRP, is proposed in this work. It is assumed that there exists a gallery point cloud set that contains point cloud objects with given pose orientation information. PCRP attempts to register the unknown point cloud object with those in the gallery set so as to achieve content-based object retrieval and pose estimation jointly, where the point cloud registration task is built upon an enhanced version of the unsupervised R-PointHop method. Experiments on the ModelNet40 dataset demonstrate the superior performance of PCRP in comparison with traditional and learning based methods.

CVDec 8, 2021
GreenPCO: An Unsupervised Lightweight Point Cloud Odometry Method

Pranav Kadam, Min Zhang, Jiahao Gu et al.

Visual odometry aims to track the incremental motion of an object using the information captured by visual sensors. In this work, we study the point cloud odometry problem, where only the point cloud scans obtained by the LiDAR (Light Detection And Ranging) are used to estimate object's motion trajectory. A lightweight point cloud odometry solution is proposed and named the green point cloud odometry (GreenPCO) method. GreenPCO is an unsupervised learning method that predicts object motion by matching features of consecutive point cloud scans. It consists of three steps. First, a geometry-aware point sampling scheme is used to select discriminant points from the large point cloud. Second, the view is partitioned into four regions surrounding the object, and the PointHop++ method is used to extract point features. Third, point correspondences are established to estimate object motion between two consecutive scans. Experiments on the KITTI dataset are conducted to demonstrate the effectiveness of the GreenPCO method. It is observed that GreenPCO outperforms benchmarking deep learning methods in accuracy while it has a significantly smaller model size and less training time.

IVNov 16, 2021
Online Meta Adaptation for Variable-Rate Learned Image Compression

Wei Jiang, Wei Wang, Songnan Li et al.

This work addresses two major issues of end-to-end learned image compression (LIC) based on deep neural networks: variable-rate learning where separate networks are required to generate compressed images with varying qualities, and the train-test mismatch between differentiable approximate quantization and true hard quantization. We introduce an online meta-learning (OML) setting for LIC, which combines ideas from meta learning and online learning in the conditional variational auto-encoder (CVAE) framework. By treating the conditional variables as meta parameters and treating the generated conditional features as meta priors, the desired reconstruction can be controlled by the meta parameters to accommodate compression with variable qualities. The online learning framework is used to update the meta parameters so that the conditional reconstruction is adaptively tuned for the current image. Through the OML mechanism, the meta parameters can be effectively updated through SGD. The conditional reconstruction is directly based on the quantized latent representation in the decoder network, and therefore helps to bridge the gap between the training estimation and true quantized latent distribution. Experiments demonstrate that our OML approach can be flexibly applied to different state-of-the-art LIC methods to achieve additional performance improvements with little computation and transmission overhead.

CVSep 24, 2021
GSIP: Green Semantic Segmentation of Large-Scale Indoor Point Clouds

Min Zhang, Pranav Kadam, Shan Liu et al.

An efficient solution to semantic segmentation of large-scale indoor scene point clouds is proposed in this work. It is named GSIP (Green Segmentation of Indoor Point clouds) and its performance is evaluated on a representative large-scale benchmark -- the Stanford 3D Indoor Segmentation (S3DIS) dataset. GSIP has two novel components: 1) a room-style data pre-processing method that selects a proper subset of points for further processing, and 2) a new feature extractor which is extended from PointHop. For the former, sampled points of each room form an input unit. For the latter, the weaknesses of PointHop's feature extraction when extending it to large-scale point clouds are identified and fixed with a simpler processing pipeline. As compared with PointNet, which is a pioneering deep-learning-based solution, GSIP is green since it has significantly lower computational complexity and a much smaller model size. Furthermore, experiments show that GSIP outperforms PointNet in segmentation performance for the S3DIS dataset.

CVAug 4, 2021
Combining Attention with Flow for Person Image Synthesis

Yurui Ren, Yubo Wu, Thomas H. Li et al.

Pose-guided person image synthesis aims to synthesize person images by transforming reference images into target poses. In this paper, we observe that the commonly used spatial transformation blocks have complementary advantages. We propose a novel model by combining the attention operation with the flow-based operation. Our model not only takes the advantage of the attention operation to generate accurate target structures but also uses the flow-based operation to sample realistic source textures. Both objective and subjective experiments demonstrate the superiority of our model. Meanwhile, comprehensive ablation studies verify our hypotheses and show the efficacy of the proposed modules. Besides, additional experiments on the portrait image editing task demonstrate the versatility of the proposed combination.

MMJul 29, 2021
Video-based Point Cloud Compression Artifact Removal

Anique Akhtar, Wen Gao, Li Li et al.

Photo-realistic point cloud capture and transmission are the fundamental enablers for immersive visual communication. The coding process of dynamic point clouds, especially video-based point cloud compression (V-PCC) developed by the MPEG standardization group, is now delivering state-of-the-art performance in compression efficiency. V-PCC is based on the projection of the point cloud patches to 2D planes and encoding the sequence as 2D texture and geometry patch sequences. However, the resulting quantization errors from coding can introduce compression artifacts, which can be very unpleasant for the quality of experience (QoE). In this work, we developed a novel out-of-the-loop point cloud geometry artifact removal solution that can significantly improve reconstruction quality without additional bandwidth cost. Our novel framework consists of a point cloud sampling scheme, an artifact removal network, and an aggregation scheme. The point cloud sampling scheme employs a cube-based neighborhood patch extraction to divide the point cloud into patches. The geometry artifact removal network then processes these patches to obtain artifact-removed patches. The artifact-removed patches are then merged together using an aggregation scheme to obtain the final artifact-removed point cloud. We employ 3D deep convolutional feature learning for geometry artifact removal that jointly recovers both the quantization direction and the quantization noise level by exploiting projection and quantization prior. The simulation results demonstrate that the proposed method is highly effective and can considerably improve the quality of the reconstructed point cloud.

LGJun 15, 2021
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression

Sheng Lin, Wei Jiang, Wei Wang et al.

Compressing Deep Neural Network (DNN) models to alleviate the storage and computation requirements is essential for practical applications, especially for resource limited devices. Although capable of reducing a reasonable amount of model parameters, previous unstructured or structured weight pruning methods can hardly truly accelerate inference, either due to the poor hardware compatibility of the unstructured sparsity or due to the low sparse rate of the structurally pruned network. Aiming at reducing both storage and computation, as well as preserving the original task performance, we propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration. Weight coefficients of a selected micro-structured block are unified to reduce the storage and computation of the block without changing the neuron connections, which turns to a micro-structured pruning special case when all unified coefficients are set to zero, where neuron connections (hence storage and computation) are completely removed. In addition, we developed an effective training framework based on the alternating direction method of multipliers (ADMM), which converts our complex constrained optimization into separately solvable subproblems. Through iteratively optimizing the subproblems, the desired micro-structure can be ensured with high compression ratio and low performance degradation. We extensively evaluated our method using a variety of benchmark models and datasets for different applications. Experimental results demonstrate state-of-the-art performance.

CVMay 26, 2021
Recent Standard Development Activities on Video Coding for Machines

Wen Gao, Shan Liu, Xiaozhong Xu et al.

In recent years, video data has dominated internet traffic and becomes one of the major data formats. With the emerging 5G and internet of things (IoT) technologies, more and more videos are generated by edge devices, sent across networks, and consumed by machines. The volume of video consumed by machine is exceeding the volume of video consumed by humans. Machine vision tasks include object detection, segmentation, tracking, and other machine-based applications, which are quite different from those for human consumption. On the other hand, due to large volumes of video data, it is essential to compress video before transmission. Thus, efficient video coding for machines (VCM) has become an important topic in academia and industry. In July 2019, the international standardization organization, i.e., MPEG, created an Ad-Hoc group named VCM to study the requirements for potential standardization work. In this paper, we will address the recent development activities in the MPEG VCM group. Specifically, we will first provide an overview of the MPEG VCM group including use cases, requirements, processing pipelines, plan for potential VCM standards, followed by the evaluation framework including machine-vision tasks, dataset, evaluation metrics, and anchor generation. We then introduce technology solutions proposed so far and discuss the recent responses to the Call for Evidence issued by MPEG VCM group.