Duolikun Danier

CV
Semantic Scholar Profile
h-index50
18papers
287citations
Novelty46%
AI Score54

18 Papers

IVMar 16, 2023Code
LDMVFI: Video Frame Interpolation with Latent Diffusion Models

Duolikun Danier, Fan Zhang, David Bull

Existing works on video frame interpolation (VFI) mostly employ deep neural networks that are trained by minimizing the L1, L2, or deep feature space distance (e.g. VGG loss) between their outputs and ground-truth frames. However, recent works have shown that these metrics are poor indicators of perceptual VFI quality. Towards developing perceptually-oriented VFI methods, in this work we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method on common test sets used in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with favorable perceptual quality compared to the state of the art, even in the high-resolution regime. Our code is available at https://github.com/danier97/LDMVFI.

LGOct 31, 2023Code
Score Normalization for a Faster Diffusion Exponential Integrator Sampler

Guoxuan Xia, Duolikun Danier, Ayan Das et al. · cambridge

Recently, Zhang et al. have proposed the Diffusion Exponential Integrator Sampler (DEIS) for fast generation of samples from Diffusion Models. It leverages the semi-linear nature of the probability flow ordinary differential equation (ODE) in order to greatly reduce integration error and improve generation quality at low numbers of function evaluations (NFEs). Key to this approach is the score function reparameterisation, which reduces the integration error incurred from using a fixed score function estimate over each integration step. The original authors use the default parameterisation used by models trained for noise prediction -- multiply the score by the standard deviation of the conditional forward noising distribution. We find that although the mean absolute value of this score parameterisation is close to constant for a large portion of the reverse sampling process, it changes rapidly at the end of sampling. As a simple fix, we propose to instead reparameterise the score (at inference) by dividing it by the average absolute value of previous score estimates at that time step collected from offline high NFE generations. We find that our score normalisation (DEIS-SN) consistently improves FID compared to vanilla DEIS, showing an improvement at 10 NFEs from 6.44 to 5.57 on CIFAR-10 and from 5.9 to 4.95 on LSUN-Church 64x64. Our code is available at https://github.com/mtkresearch/Diffusion-DEIS-SN

IVOct 3, 2022Code
BVI-VFI: A Video Quality Database for Video Frame Interpolation

Duolikun Danier, Fan Zhang, David Bull

Video frame interpolation (VFI) is a fundamental research topic in video processing, which is currently attracting increased attention across the research community. While the development of more advanced VFI algorithms has been extensively researched, there remains little understanding of how humans perceive the quality of interpolated content and how well existing objective quality assessment methods perform when measuring the perceived quality. In order to narrow this research gap, we have developed a new video quality database named BVI-VFI, which contains 540 distorted sequences generated by applying five commonly used VFI algorithms to 36 diverse source videos with various spatial resolutions and frame rates. We collected more than 10,800 quality ratings for these videos through a large scale subjective study involving 189 human subjects. Based on the collected subjective scores, we further analysed the influence of VFI algorithms and frame rates on the perceptual quality of interpolated videos. Moreover, we benchmarked the performance of 33 classic and state-of-the-art objective image/video quality metrics on the new database, and demonstrated the urgent requirement for more accurate bespoke quality assessment methods for VFI. To facilitate further research in this area, we have made BVI-VFI publicly available at https://github.com/danier97/BVI-VFI-database.

IVJul 17, 2022
FloLPIPS: A Bespoke Video Quality Metric for Frame Interpoation

Duolikun Danier, Fan Zhang, David Bull

Video frame interpolation (VFI) serves as a useful tool for many video processing applications. Recently, it has also been applied in the video compression domain for enhancing both conventional video codecs and learning-based compression architectures. While there has been an increased focus on the development of enhanced frame interpolation algorithms in recent years, the perceptual quality assessment of interpolated content remains an open field of research. In this paper, we present a bespoke full reference video quality metric for VFI, FloLPIPS, that builds on the popular perceptual image quality metric, LPIPS, which captures the perceptual degradation in extracted image feature space. In order to enhance the performance of LPIPS for evaluating interpolated content, we re-designed its spatial feature aggregation step by using the temporal distortion (through comparing optical flows) to weight the feature difference maps. Evaluated on the BVI-VFI database, which contains 180 test sequences with various frame interpolation artefacts, FloLPIPS shows superior correlation performance (with statistical significance) with subjective ground truth over 12 popular quality assessors. To facilitate further research in VFI quality assessment, our code is publicly available at https://danier97.github.io/FloLPIPS.

AIFeb 12Code
SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Chengxi Zeng, Yuxuan Jiang, Ge Gao et al.

Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.

RONov 13, 2025
Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier et al.

The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa

CVDec 17, 2025
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Zhipeng Du, Duolikun Danier, Jan Eric Lenssen et al.

In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.

CVNov 10, 2025
GFix: Perceptually Enhanced Gaussian Splatting Video Compression

Siyue Teng, Ge Gao, Duolikun Danier et al.

3D Gaussian Splatting (3DGS) enhances 3D scene reconstruction through explicit representation and fast rendering, demonstrating potential benefits for various low-level vision tasks, including video compression. However, existing 3DGS-based video codecs generally exhibit more noticeable visual artifacts and relatively low compression ratios. In this paper, we specifically target the perceptual enhancement of 3DGS-based video compression, based on the assumption that artifacts from 3DGS rendering and quantization resemble noisy latents sampled during diffusion training. Building on this premise, we propose a content-adaptive framework, GFix, comprising a streamlined, single-step diffusion model that serves as an off-the-shelf neural enhancer. Moreover, to increase compression efficiency, We propose a modulated LoRA scheme that freezes the low-rank decompositions and modulates the intermediate hidden states, thereby achieving efficient adaptation of the diffusion backbone with highly compressible updates. Experimental results show that GFix delivers strong perceptual quality enhancement, outperforming GSVC with up to 72.1% BD-rate savings in LPIPS and 21.4% in FID.

CVNov 26, 2024
DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Duolikun Danier, Mehmet Aygün, Changjian Li et al.

Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.

CVDec 14, 2023
BVI-Artefact: An Artefact Detection Benchmark Dataset for Streamed Videos

Chen Feng, Duolikun Danier, Fan Zhang et al.

Professionally generated content (PGC) streamed online can contain visual artefacts that degrade the quality of user experience. These artefacts arise from different stages of the streaming pipeline, including acquisition, post-production, compression, and transmission. To better guide streaming experience enhancement, it is important to detect specific artefacts at the user end in the absence of a pristine reference. In this work, we address the lack of a comprehensive benchmark for artefact detection within streamed PGC, via the creation and validation of a large database, BVI-Artefact. Considering the ten most relevant artefact types encountered in video streaming, we collected and generated 480 video sequences, each containing various artefacts with associated binary artefact labels. Based on this new database, existing artefact detection methods are benchmarked, with results showing the challenging nature of this tasks and indicating the requirement of more reliable artefact detection methods. To facilitate further research in this area, we have made BVI-Artifact publicly available at https://chenfeng-bristol.github.io/BVI-Artefact/

ROFeb 5, 2025
The Temporal Trap: Entanglement in Pre-Trained Visual Representations for Visuomotor Policy Learning

Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier et al.

The integration of pre-trained visual representations (PVRs) has significantly advanced visuomotor policy learning. However, effectively leveraging these models remains a challenge. We identify temporal entanglement as a critical, inherent issue when using these time-invariant models in sequential decision-making tasks. This entanglement arises because PVRs, optimised for static image understanding, struggle to represent the temporal dependencies crucial for visuomotor control. In this work, we quantify the impact of temporal entanglement, demonstrating a strong correlation between a policy's success rate and the ability of its latent space to capture task-progression cues. Based on these insights, we propose a simple, yet effective disentanglement baseline designed to mitigate temporal entanglement. Our empirical results show that traditional methods aimed at enriching features with temporal components are insufficient on their own, highlighting the necessity of explicitly addressing temporal disentanglement for robust visuomotor policy learning.

IVMay 14, 2024
RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality Assessment for Enhanced Video Content

Tianhao Peng, Chen Feng, Duolikun Danier et al.

With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artifacts, and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimized through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.

IVDec 14, 2023
RankDVQA-mini: Knowledge Distillation-Driven Deep Video Quality Assessment

Chen Feng, Duolikun Danier, Haoran Wang et al.

Deep learning-based video quality assessment (deep VQA) has demonstrated significant potential in surpassing conventional metrics, with promising improvements in terms of correlation with human perception. However, the practical deployment of such deep VQA models is often limited due to their high computational complexity and large memory requirements. To address this issue, we aim to significantly reduce the model size and runtime of one of the state-of-the-art deep VQA methods, RankDVQA, by employing a two-phase workflow that integrates pruning-driven model compression with multi-level knowledge distillation. The resulting lightweight full reference quality metric, RankDVQA-mini, requires less than 10% of the model parameters compared to its full version (14% in terms of FLOPs), while still retaining a quality prediction performance that is superior to most existing deep VQA methods. The source code of the RankDVQA-mini has been released at https://chenfeng-bristol.github.io/RankDVQA-mini/ for public evaluation.

CVNov 24, 2025
View-Consistent Diffusion Representations for 3D-Consistent Video Generation

Duolikun Danier, Ge Gao, Steven McDonagh et al.

Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.

CVFeb 15, 2022
Enhancing Deformable Convolution based Video Frame Interpolation with Coarse-to-fine 3D CNN

Duolikun Danier, Fan Zhang, David Bull

This paper presents a new deformable convolution-based video frame interpolation (VFI) method, using a coarse to fine 3D CNN to enhance the multi-flow prediction. This model first extracts spatio-temporal features at multiple scales using a 3D CNN, and estimates multi-flows using these features in a coarse-to-fine manner. The estimated multi-flows are then used to warp the original input frames as well as context maps, and the warped results are fused by a synthesis network to produce the final output. This VFI approach has been fully evaluated against 12 state-of-the-art VFI methods on three commonly used test databases. The results evidently show the effectiveness of the proposed method, which offers superior interpolation performance over other state of the art algorithms, with PSNR gains up to 0.19dB.

IVFeb 15, 2022
A Subjective Quality Study for Video Frame Interpolation

Duolikun Danier, Fan Zhang, David Bull

Video frame interpolation (VFI) is one of the fundamental research areas in video processing and there has been extensive research on novel and enhanced interpolation algorithms. The same is not true for quality assessment of the interpolated content. In this paper, we describe a subjective quality study for VFI based on a newly developed video database, BVI-VFI. BVI-VFI contains 36 reference sequences at three different frame rates and 180 distorted videos generated using five conventional and learning based VFI algorithms. Subjective opinion scores have been collected from 60 human participants, and then employed to evaluate eight popular quality metrics, including PSNR, SSIM and LPIPS which are all commonly used for assessing VFI methods. The results indicate that none of these metrics provide acceptable correlation with the perceived quality on interpolated content, with the best-performing metric, LPIPS, offering a SROCC value below 0.6. Our findings show that there is an urgent need to develop a bespoke perceptual quality metric for VFI. The BVI-VFI dataset is publicly available and can be accessed at https://danier97.github.io/BVI-VFI/.

CVNov 30, 2021
ST-MFNet: A Spatio-Temporal Multi-Flow Network for Frame Interpolation

Duolikun Danier, Fan Zhang, David Bull

Video frame interpolation (VFI) is currently a very active research topic, with applications spanning computer vision, post production and video encoding. VFI can be extremely challenging, particularly in sequences containing large motions, occlusions or dynamic textures, where existing approaches fail to offer perceptually robust interpolation performance. In this context, we present a novel deep learning based VFI method, ST-MFNet, based on a Spatio-Temporal Multi-Flow architecture. ST-MFNet employs a new multi-scale multi-flow predictor to estimate many-to-one intermediate flows, which are combined with conventional one-to-one optical flows to capture both large and complex motions. In order to enhance interpolation performance for various textures, a 3D CNN is also employed to model the content dynamics over an extended temporal window. Moreover, ST-MFNet has been trained within an ST-GAN framework, which was originally developed for texture synthesis, with the aim of further improving perceptual interpolation quality. Our approach has been comprehensively evaluated -- compared with fourteen state-of-the-art VFI algorithms -- clearly demonstrating that ST-MFNet consistently outperforms these benchmarks on varied and representative test datasets, with significant gains up to 1.09dB in PSNR for cases including large motions and dynamic textures. Project page: https://danielism97.github.io/ST-MFNet.

IVFeb 26, 2021
Texture-aware Video Frame Interpolation

Duolikun Danier, David Bull

Temporal interpolation has the potential to be a powerful tool for video compression. Existing methods for frame interpolation do not discriminate between video textures and generally invoke a single general model capable of interpolating a wide range of video content. However, past work on video texture analysis and synthesis has shown that different textures exhibit vastly different motion characteristics and they can be divided into three classes (static, dynamic continuous and dynamic discrete). In this work, we study the impact of video textures on video frame interpolation, and propose a novel framework where, given an interpolation algorithm, separate models are trained on different textures. Our study shows that video texture has significant impact on the performance of frame interpolation models and it is beneficial to have separate models specifically adapted to these texture classes, instead of training a single model that tries to learn generic motion. Our results demonstrate that models fine-tuned using our framework achieve, on average, a 0.3dB gain in PSNR on the test set used.