Shengxi Li

CV
h-index24
26papers
323citations
Novelty51%
AI Score58

26 Papers

94.9CVJun 4
CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors

Shengxi Li, Zhaokun Hu, Ce Zheng et al.

Unsupervised conditional image generation (UCGen) aims to control generation without relying on manually annotated labels, yet remains challenging due to unstructured semantic representations across granularities. To address this, we propose a novel coarse-to-fine UCGen framework (CoFi-UCGen) that explicitly disentangles global semantics from fine-grained variations, which to the best of our knowledge, sets out the first successful attempt for both coarse- and fine-grained conditional generation without any labels. More specifically, we first propose the adversarial semantic reciprocal learning theory to ensure the semantic consistency and completeness between images and latent spaces. Based on the consistency, we propose the bit-codes to learn a structured coarse-grained latent space, and further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling for generation. Building upon these bit-codes, we establish a fine-grained semantic basis and introduce a hierarchical modulation mechanism in diffusion models, by enabling layer-wise injection from coarse conditions to progressively control fine-grained attributes during generation. Extensive experiments demonstrate that without any label priors or pre-trained feature extractors, our CoFi-UCGen consistently outperforms existing UCGen methods in terms of image quality, semantic consistency, and control accuracy, verifying the effectiveness of explicit coarse-to-fine semantic decomposition for the challenging UCGen task.

73.9CVApr 19
Low Light Image Enhancement Challenge at NTIRE 2026

George Ciubotariu, Sharif S M A, Abdur Rehman et al.

This paper presents a comprehensive review of the NTIRE 2026 Low Light Image Enhancement Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions by learning representative visual cues with the purpose of restoring information loss due to low-contrast and noisy images. A total of 195 participants registered for the first track and 153 for the second track of the competition, and 22 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in (joint denoising and) low-light image enhancement, showcasing the significant progress in the field, while leveraging samples of our novel dataset.

58.1CVApr 12
NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets, Methods and Results

Xin Li, Jiachao Gong, Xijun Wang et al.

This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.

CVOct 16, 2022
Demystifying CNNs for Images by Matched Filters

Shengxi Li, Xinyi Zhao, Ljubisa Stankovic et al.

The success of convolution neural networks (CNN) has been revolutionising the way we approach and use intelligent machines in the Big Data era. Despite success, CNNs have been consistently put under scrutiny owing to their \textit{black-box} nature, an \textit{ad hoc} manner of their construction, together with the lack of theoretical support and physical meanings of their operation. This has been prohibitive to both the quantitative and qualitative understanding of CNNs, and their application in more sensitive areas such as AI for health. We set out to address these issues, and in this way demystify the operation of CNNs, by employing the perspective of matched filtering. We first illuminate that the convolution operation, the very core of CNNs, represents a matched filter which aims to identify the presence of features in input data. This then serves as a vehicle to interpret the convolution-activation-pooling chain in CNNs under the theoretical umbrella of matched filtering, a common operation in signal processing. We further provide extensive examples and experiments to illustrate this connection, whereby the learning in CNNs is shown to also perform matched filtering, which further sheds light onto physical meaning of learnt parameters and layers. It is our hope that this material will provide new insights into the understanding, constructing and analysing of CNNs, as well as paving the way for developing new methods and architectures of CNNs.

CVJul 16, 2024
QVD: Post-training Quantization for Video Diffusion Models

Shilong Tian, Hong Chen, Chengtao Lv et al.

Recently, video diffusion models (VDMs) have garnered significant attention due to their notable advancements in generating coherent and realistic video content. However, processing multiple frame features concurrently, coupled with the considerable model size, results in high latency and extensive memory consumption, hindering their broader application. Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. Unlike image diffusion, we observe that the temporal features, which are integrated into all frame features, exhibit pronounced skewness. Furthermore, we investigate significant inter-channel disparities and asymmetries in the activation of video diffusion models, resulting in low coverage of quantization levels by individual channels and increasing the challenge of quantization. To address these issues, we introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. Specifically, we propose the High Temporal Discriminability Quantization (HTDQ) method, designed for temporal features, which retains the high discriminability of quantized features, providing precise temporal guidance for all video frames. In addition, we present the Scattered Channel Range Integration (SCRI) method which aims to improve the coverage of quantization levels across individual channels. Experimental validations across various models, datasets, and bit-width settings demonstrate the effectiveness of our QVD in terms of diverse metrics. In particular, we achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.

CVNov 12, 2025Code
Machines Serve Human: A Novel Variable Human-machine Collaborative Compression Framework

Zifu Zhang, Shengxi Li, Xiancheng Sun et al.

Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.

CVJan 26
REMAC: Reference-Based Martian Asymmetrical Image Compression

Qing Ding, Mai Xu, Shengxi Li et al.

To expedite space exploration on Mars, it is indispensable to develop an efficient Martian image compression method for transmitting images through the constrained Mars-to-Earth communication channel. Although the existing learned compression methods have achieved promising results for natural images from earth, there remain two critical issues that hinder their effectiveness for Martian image compression: 1) They overlook the highly-limited computational resources on Mars; 2) They do not utilize the strong \textit{inter-image} similarities across Martian images to advance image compression performance. Motivated by our empirical analysis of the strong \textit{intra-} and \textit{inter-image} similarities from the perspective of texture, color, and semantics, we propose a reference-based Martian asymmetrical image compression (REMAC) approach, which shifts computational complexity from the encoder to the resource-rich decoder and simultaneously improves compression performance. To leverage \textit{inter-image} similarities, we propose a reference-guided entropy module and a ref-decoder that utilize useful information from reference images, reducing redundant operations at the encoder and achieving superior compression performance. To exploit \textit{intra-image} similarities, the ref-decoder adopts a deep, multi-scale architecture with enlarged receptive field size to model long-range spatial dependencies. Additionally, we develop a latent feature recycling mechanism to further alleviate the extreme computational constraints on Mars. Experimental results show that REMAC reduces encoder complexity by 43.51\% compared to the state-of-the-art method, while achieving a BD-PSNR gain of 0.2664 dB.

CVNov 11, 2025
Burst Image Quality Assessment: A New Benchmark and Unified Framework for Multiple Downstream Tasks

Xiaoye Liang, Lai Jiang, Minglang Qiao et al.

In recent years, the development of burst imaging technology has improved the capture and processing capabilities of visual data, enabling a wide range of applications. However, the redundancy in burst images leads to the increased storage and transmission demands, as well as reduced efficiency of downstream tasks. To address this, we propose a new task of Burst Image Quality Assessment (BuIQA), to evaluate the task-driven quality of each frame within a burst sequence, providing reasonable cues for burst image selection. Specifically, we establish the first benchmark dataset for BuIQA, consisting of $7,346$ burst sequences with $45,827$ images and $191,572$ annotated quality scores for multiple downstream scenarios. Inspired by the data analysis, a unified BuIQA framework is proposed to achieve an efficient adaption for BuIQA under diverse downstream scenarios. Specifically, a task-driven prompt generation network is developed with heterogeneous knowledge distillation, to learn the priors of the downstream task. Then, the task-aware quality assessment network is introduced to assess the burst image quality based on the task prompt. Extensive experiments across 10 downstream scenarios demonstrate the impressive BuIQA performance of the proposed approach, outperforming the state-of-the-art. Furthermore, it can achieve $0.33$ dB PSNR improvement in the downstream tasks of denoising and super-resolution, by applying our approach to select the high-quality burst frames.

76.9CVMay 17
Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration

Xinghua Huang, Zhixiong Yang, Chen Wu et al.

A fundamental difficulty in all-in-one blind image restoration is that degradation is usually treated as an implicit factor hidden in degraded-to-clean mapping, rather than as an explicit object that can be measured and manipulated. This limitation becomes more pronounced under mixed, compound, or unseen degradation conditions, where degradation effects are hard to assign to predefined labels or task-specific parameters. We propose the Degradation Frequency Curve (DFC), a structured spectral representation that quantifies degradation responses by measuring band-wise residual-to-degraded energy ratios in the frequency domain. DFC converts visually entangled and hard-to-describe degradation effects into a measurable degradation coordinate space. Moreover, DFC can be adaptively decomposed into band-wise spectral tokens, allowing local degradation responses to be represented as reusable restoration priors. Based on this representation, we develop the DFC-guided Image Restorer (DFC-IR), a token-conditioned multi-scale framework that progressively estimates DFCs from intermediate restorations and uses the resulting spectral tokens to guide degradation-aware restoration in a coarse-to-fine manner. Extensive experiments on standard, composite, unseen, and real-world degradation benchmarks show that DFC provides an effective representation basis for all-in-one restoration, leading to state-of-the-art performance and improved generalization under complex degradation profiles.

CVDec 24, 2025
Benchmarking and Enhancing VLM for Compressed Image Understanding

Zifu Zhang, Tongda Xu, Siqi Li et al.

With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.

LGSep 19, 2024
CF-GO-Net: A Universal Distribution Learner via Characteristic Function Networks with Graph Optimizers

Zeyang Yu, Shengxi Li, Danilo Mandic

Generative models aim to learn the distribution of datasets, such as images, so as to be able to generate samples that statistically resemble real data. However, learning the underlying probability distribution can be very challenging and intractable. To this end, we introduce an approach which employs the characteristic function (CF), a probabilistic descriptor that directly corresponds to the distribution. However, unlike the probability density function (pdf), the characteristic function not only always exists, but also provides an additional degree of freedom, hence enhances flexibility in learning distributions. This removes the critical dependence on pdf-based assumptions, which limit the applicability of traditional methods. While several works have attempted to use CF in generative modeling, they often impose strong constraints on the training process. In contrast, our approach calculates the distance between query points in the CF domain, which is an unconstrained and well defined problem. Next, to deal with the sampling strategy, which is crucial to model performance, we propose a graph neural network (GNN)-based optimizer for the sampling process, which identifies regions where the difference between CFs is most significant. In addition, our method allows the use of a pre-trained model, such as a well-trained autoencoder, and is capable of learning directly in its feature space, without modifying its parameters. This offers a flexible and robust approach to generative modeling, not only provides broader applicability and improved performance, but also equips any latent space world with the ability to become a generative model.

CVFeb 24, 2025Code
Continuous Patch Stitching for Block-wise Image Compression

Zifu Zhang, Shengxi Li, Henan Liu et al.

Most recently, learned image compression methods have outpaced traditional hand-crafted standard codecs. However, their inference typically requires to input the whole image at the cost of heavy computing resources, especially for high-resolution image compression; otherwise, the block artefact can exist when compressed by blocks within existing learned image compression methods. To address this issue, we propose a novel continuous patch stitching (CPS) framework for block-wise image compression that is able to achieve seamlessly patch stitching and mathematically eliminate block artefact, thus capable of significantly reducing the required computing resources when compressing images. More specifically, the proposed CPS framework is achieved by padding-free operations throughout, with a newly established parallel overlapping stitching strategy to provide a general upper bound for ensuring the continuity. Upon this, we further propose functional residual blocks with even-sized kernels to achieve down-sampling and up-sampling, together with bottleneck residual blocks retaining feature size to increase network depth. Experimental results demonstrate that our CPS framework achieves the state-of-the-art performance against existing baselines, whilst requiring less than half of computing resources of existing models. Our code shall be released upon acceptance.

IVJun 13, 2024Code
Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

Jingyuan Xia, Zhixiong Yang, Shengxi Li et al.

Learning-based approaches have witnessed great successes in blind single image super-resolution (SISR) tasks, however, handcrafted kernel priors and learning based kernel priors are typically required. In this paper, we propose a Meta-learning and Markov Chain Monte Carlo (MCMC) based SISR approach to learn kernel priors from organized randomness. In concrete, a lightweight network is adopted as kernel generator, and is optimized via learning from the MCMC simulation on random Gaussian distributions. This procedure provides an approximation for the rational blur kernel, and introduces a network-level Langevin dynamics into SISR optimization processes, which contributes to preventing bad local optimal solutions for kernel estimation. Meanwhile, a meta-learning-based alternating optimization procedure is proposed to optimize the kernel generator and image restorer, respectively. In contrast to the conventional alternating minimization strategy, a meta-learning-based framework is applied to learn an adaptive optimization strategy, which is less-greedy and results in better convergence performance. These two procedures are iteratively processed in a plug-and-play fashion, for the first time, realizing a learning-based but plug-and-play blind SISR solution in unsupervised inference. Extensive simulations demonstrate the superior performance and generalization ability of the proposed approach when comparing with state-of-the-arts on synthesis and real-world datasets. The code is available at https://github.com/XYLGroup/MLMC.

CVFeb 24, 2025Code
Hierarchical Semantic Compression for Consistent Image Semantic Restoration

Shengxi Li, Zifu Zhang, Mai Xu et al.

The emerging semantic compression has been receiving increasing research efforts most recently, capable of achieving high fidelity restoration during compression, even at extremely low bitrates. However, existing semantic compression methods typically combine standard pipelines with either pre-defined or high-dimensional semantics, thus suffering from deficiency in compression. To address this issue, we propose a novel hierarchical semantic compression (HSC) framework that purely operates within intrinsic semantic spaces from generative models, which is able to achieve efficient compression for consistent semantic restoration. More specifically, we first analyse the entropy models for the semantic compression, which motivates us to employ a hierarchical architecture based on a newly developed general inversion encoder. Then, we propose the feature compression network (FCN) and semantic compression network (SCN), such that the middle-level semantic feature and core semantics are hierarchically compressed to restore both accuracy and consistency of image semantics, via an entropy model progressively shared by channel-wise context. Experimental results demonstrate that the proposed HSC framework achieves the state-of-the-art performance on subjective quality and consistency for human vision, together with superior performances on machine vision tasks given compressed bitstreams. This essentially coincides with human visual system in understanding images, thus providing a new framework for future image/video compression paradigms. Our code shall be released upon acceptance.

CVNov 18, 2021Code
Blind VQA on 360° Video via Progressively Learning from Pixels, Frames and Video

Li Yang, Mai Xu, Shengxi Li et al.

Blind visual quality assessment (BVQA) on 360{\textdegree} video plays a key role in optimizing immersive multimedia systems. When assessing the quality of 360{\textdegree} video, human tends to perceive its quality degradation from the viewport-based spatial distortion of each spherical frame to motion artifact across adjacent frames, ending with the video-level quality score, i.e., a progressive quality assessment paradigm. However, the existing BVQA approaches for 360{\textdegree} video neglect this paradigm. In this paper, we take into account the progressive paradigm of human perception towards spherical video quality, and thus propose a novel BVQA approach (namely ProVQA) for 360{\textdegree} video via progressively learning from pixels, frames and video. Corresponding to the progressive learning of pixels, frames and video, three sub-nets are designed in our ProVQA approach, i.e., the spherical perception aware quality prediction (SPAQ), motion perception aware quality prediction (MPAQ) and multi-frame temporal non-local (MFTN) sub-nets. The SPAQ sub-net first models the spatial quality degradation based on spherical perception mechanism of human. Then, by exploiting motion cues across adjacent frames, the MPAQ sub-net properly incorporates motion contextual information for quality assessment on 360{\textdegree} video. Finally, the MFTN sub-net aggregates multi-frame quality degradation to yield the final quality score, via exploring long-term quality correlation from multiple frames. The experiments validate that our approach significantly advances the state-of-the-art BVQA performance on 360{\textdegree} video over two datasets, the code of which has been public in \url{https://github.com/yanglixiaoshen/ProVQA.}

CVNov 3, 2025
Luminance-Aware Statistical Quantization: Unsupervised Hierarchical Learning for Illumination Enhancement

Derong Kong, Zhixiong Yang, Shengxi Li et al.

Low-light image enhancement (LLIE) faces persistent challenges in balancing reconstruction fidelity with cross-scenario generalization. While existing methods predominantly focus on deterministic pixel-level mappings between paired low/normal-light images, they often neglect the continuous physical process of luminance transitions in real-world environments, leading to performance drop when normal-light references are unavailable. Inspired by empirical analysis of natural luminance dynamics revealing power-law distributed intensity transitions, this paper introduces Luminance-Aware Statistical Quantification (LASQ), a novel framework that reformulates LLIE as a statistical sampling process over hierarchical luminance distributions. Our LASQ re-conceptualizes luminance transition as a power-law distribution in intensity coordinate space that can be approximated by stratified power functions, therefore, replacing deterministic mappings with probabilistic sampling over continuous luminance layers. A diffusion forward process is designed to autonomously discover optimal transition paths between luminance layers, achieving unsupervised distribution emulation without normal-light references. In this way, it considerably improves the performance in practical situations, enabling more adaptable and versatile light restoration. This framework is also readily applicable to cases with normal-light references, where it achieves superior performance on domain-specific datasets alongside better generalization-ability across non-reference datasets.

CVDec 20, 2024
InstructOCR: Instruction Boosting Scene Text Spotting

Chen Duan, Qianyi Jiang, Pei Fu et al.

In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.

CVFeb 27, 2024
Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain

Qunliang Xing, Mai Xu, Shengxi Li et al.

Existing quality enhancement methods for compressed images focus on aligning the enhancement domain with the raw domain to yield realistic images. However, these methods exhibit a pervasive enhancement bias towards the compression domain, inadvertently regarding it as more realistic than the raw domain. This bias makes enhanced images closely resemble their compressed counterparts, thus degrading their perceptual quality. In this paper, we propose a simple yet effective method to mitigate this bias and enhance the quality of compressed images. Our method employs a conditional discriminator with the compressed image as a key condition, and then incorporates a domain-divergence regularization to actively distance the enhancement domain from the compression domain. Through this dual strategy, our method enables the discrimination against the compression domain, and brings the enhancement domain closer to the raw domain. Comprehensive quality evaluations confirm the superiority of our method over other state-of-the-art methods without incurring inference overheads.

MLMar 14, 2021
Von Mises-Fisher Elliptical Distribution

Shengxi Li, Danilo Mandic

A large class of modern probabilistic learning systems assumes symmetric distributions, however, real-world data tend to obey skewed distributions and are thus not always adequately modelled through symmetric distributions. To address this issue, elliptical distributions are increasingly used to generalise symmetric distributions, and further improvements to skewed elliptical distributions have recently attracted much attention. However, existing approaches are either hard to estimate or have complicated and abstract representations. To this end, we propose to employ the von-Mises-Fisher (vMF) distribution to obtain an explicit and simple probability representation of the skewed elliptical distribution. This is shown not only to allow us to deal with non-symmetric learning systems, but also to provide a physically meaningful way of generalising skewed distributions. For rigour, our extension is proved to share important and desirable properties with its symmetric counterpart. We also demonstrate that the proposed vMF distribution is both easy to generate and stable to estimate, both theoretically and through examples.

LGSep 9, 2020
Meta-learning based Alternating Minimization Algorithm for Non-convex Optimization

Jingyuan Xia, Shengxi Li, Jun-Jie Huang et al.

In this paper, we propose a novel solution for non-convex problems of multiple variables, especially for those typically solved by an alternating minimization (AM) strategy that splits the original optimization problem into a set of sub-problems corresponding to each variable, and then iteratively optimize each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the original optimization problem, the optimization can usually be trapped into spurious local minimum even when each sub-problem can be optimally solved at each iteration. Meanwhile, learning-based approaches, such as deep unfolding algorithms, are highly limited by the lack of labelled data and restricted explainability. To tackle these issues, we propose a meta-learning based alternating minimization (MLAM) method, which aims to minimize a partial of the global losses over iterations instead of carrying minimization on each sub-problem, and it tends to learn an adaptive strategy to replace the handcrafted counterpart resulting in advance on superior performance. Meanwhile, the proposed MLAM still maintains the original algorithmic principle, which contributes to a better interpretability. We evaluate the proposed method on two representative problems, namely, bi-linear inverse problem: matrix completion, and non-linear problem: Gaussian mixture models. The experimental results validate that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in challenging cases while other comparing methods would typically fail.

LGJun 15, 2020
Reciprocal Adversarial Learning via Characteristic Functions

Shengxi Li, Zeyang Yu, Min Xiang et al.

Generative adversarial nets (GANs) have become a preferred tool for tasks involving complicated distributions. To stabilise the training and reduce the mode collapse of GANs, one of their main variants employs the integral probability metric (IPM) as the loss function. This provides extensive IPM-GANs with theoretical support for basically comparing moments in an embedded domain of the \textit{critic}. We generalise this by comparing the distributions rather than their moments via a powerful tool, i.e., the characteristic function (CF), which uniquely and universally comprising all the information about a distribution. For rigour, we first establish the physical meaning of the phase and amplitude in CF, and show that this provides a feasible way of balancing the accuracy and diversity of generation. We then develop an efficient sampling strategy to calculate the CFs. Within this framework, we further prove an equivalence between the embedded and data domains when a reciprocal exists, where we naturally develop the GAN in an auto-encoder structure, in a way of comparing everything in the embedded space (a semantically meaningful manifold). This efficient structure uses only two modules, together with a simple training strategy, to achieve bi-directionally generating clear images, which is referred to as the reciprocal CF GAN (RCF-GAN). Experimental results demonstrate the superior performances of the proposed RCF-GAN in terms of both generation and reconstruction.

ITJan 2, 2020
Graph Signal Processing -- Part III: Machine Learning on Graphs, from Graph Topology to Applications

Ljubisa Stankovic, Danilo Mandic, Milos Dakovic et al.

Many modern data analytics applications on graphs operate on domains where graph topology is not known a priori, and hence its determination becomes part of the problem definition, rather than serving as prior knowledge which aids the problem solution. Part III of this monograph starts by addressing ways to learn graph topology, from the case where the physics of the problem already suggest a possible topology, through to most general cases where the graph topology is learned from the data. A particular emphasis is on graph topology definition based on the correlation and precision matrices of the observed data, combined with additional prior knowledge and structural conditions, such as the smoothness or sparsity of graph connections. For learning sparse graphs (with small number of edges), the least absolute shrinkage and selection operator, known as LASSO is employed, along with its graph specific variant, graphical LASSO. For completeness, both variants of LASSO are derived in an intuitive way, and explained. An in-depth elaboration of the graph topology learning paradigm is provided through several examples on physically well defined graphs, such as electric circuits, linear heat transfer, social and computer networks, and spring-mass systems. As many graph neural networks (GNN) and convolutional graph networks (GCN) are emerging, we have also reviewed the main trends in GNNs and GCNs, from the perspective of graph signal filtering. Tensor representation of lattice-structured graphs is next considered, and it is shown that tensors (multidimensional data arrays) are a special class of graph signals, whereby the graph vertices reside on a high-dimensional regular lattice structure. This part of monograph concludes with two emerging applications in financial data processing and underground transportation networks modeling.

LGJun 9, 2019
Solving general elliptical mixture models through an approximate Wasserstein manifold

Shengxi Li, Zeyang Yu, Min Xiang et al.

We address the estimation problem for general finite mixture models, with a particular focus on the elliptical mixture models (EMMs). Compared to the widely adopted Kullback-Leibler divergence, we show that the Wasserstein distance provides a more desirable optimisation space. We thus provide a stable solution to the EMMs that is both robust to initialisations and reaches a superior optimum by adaptively optimising along a manifold of an approximate Wasserstein distance. To this end, we first provide a unifying account of computable and identifiable EMMs, which serves as a basis to rigorously address the underpinning optimisation problem. Due to a probability constraint, solving this problem is extremely cumbersome and unstable, especially under the Wasserstein distance. To relieve this issue, we introduce an efficient optimisation method on a statistical manifold defined under an approximate Wasserstein distance, which allows for explicit metrics and computable operations, thus significantly stabilising and improving the EMM estimation. We further propose an adaptive method to accelerate the convergence. Experimental results demonstrate the excellent performance of the proposed EMM solver.

NEMar 5, 2019
Widely Linear Complex-valued Autoencoder: Dealing with Noncircularity in Generative-Discriminative Models

Zeyang Yu, Shengxi Li, Danilo Mandic

We propose a new structure for the complex-valued autoencoder by introducing additional degrees of freedom into its design through a widely linear (WL) transform. The corresponding widely linear backpropagation algorithm is also developed using the $\mathbb{CR}$ calculus, to unify the gradient calculation of the cost function and the underlying WL model. More specifically, all the existing complex-valued autoencoders employ the strictly linear transform, which is optimal only when the complex-valued outputs of each network layer are independent of the conjugate of the inputs. In addition, the widely linear model which underpins our work allows us to consider all the second-order statistics of inputs. This provides more freedom in the design and enhanced optimization opportunities, as compared to the state-of-the-art. Furthermore, we show that the most widely adopted cost function, i.e., the mean squared error, is not best suited for the complex domain, as it is a real quantity with a single degree of freedom, while both the phase and the amplitude information need to be optimized. To resolve this issue, we design a new cost function, which is capable of controlling the balance between the phase and the amplitude contribution to the solution. The experimental results verify the superior performance of the proposed autoencoder together with the new cost function, especially for the imaging scenarios where the phase preserves extensive information on edges and shapes.

LGMay 21, 2018
A universal framework for learning the elliptical mixture model

Shengxi Li, Zeyang Yu, Danilo Mandic

Mixture modelling using elliptical distributions promises enhanced robustness, flexibility and stability over the widely employed Gaussian mixture model (GMM). However, existing studies based on the elliptical mixture model (EMM) are restricted to several specific types of elliptical probability density functions, which are not supported by general solutions or systematic analysis frameworks; this significantly limits the rigour and the power of EMMs in applications. To this end, we propose a novel general framework for estimating and analysing the EMMs, achieved through Riemannian manifold optimisation. First, we investigate the relationships between Riemannian manifolds and elliptical distributions, and the so established connection between the original manifold and a reformulated one indicates a mismatch between those manifolds, the major cause of failure of the existing optimisation for solving general EMMs. We next propose a universal solver which is based on the optimisation of a re-designed cost and prove the existence of the same optimum as in the original problem; this is achieved in a simple, fast and stable way. We further calculate the influence functions of the EMM as theoretical bounds to quantify robustness to outliers. Comprehensive numerical results demonstrate the ability of the proposed framework to accommodate EMMs with different properties of individual functions in a stable way and with fast convergence speed. Finally, the enhanced robustness and flexibility of the proposed framework over the standard GMM are demonstrated both analytically and through comprehensive simulations.

MMOct 27, 2017
Watching Videos with Certain and Constant Quality: PID-based Quality Control Method

Yuhang Song, Mai Xu, Shengxi Li

In video coding, compressed videos with certain and constant quality can ensure quality of experience (QoE). To this end, we propose in this paper a novel PID-based quality control (PQC) method for video coding. Specifically, a formulation is modelled to control quality of video coding with two objectives: minimizing control error and quality fluctuation. Then, we apply the Laplace domain analysis to model the relationship between quantization parameter (QP) and control error in this formulation. Given the relationship between QP and control error, we propose a solution to the PQC formulation, such that videos can be compressed at certain and constant quality. Finally, experimental results show that our PQC method is effective in both control accuracy and quality fluctuation.