Dong Liu

CV
h-index98
196papers
16,995citations
Novelty50%
AI Score62

196 Papers

68.0ROJun 3Code
OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons

Dong Liu, Yanxuan Yu, Ben Lengerich et al.

Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real-world environments or individual user characteristics. We present \olive (\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons), a parameter-efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low-rank residual form~$\dW = \At\Bt^\top$ with rank~$r!\ll!\min(d,k)$, reducing online update cost from $\mathcal{O}(dk)$ to $\mathcal{O}(r(d{+}k))$ while preserving the stability of a pretrained base controller~$\Wz$. Parameters are updated via a reward-shaped policy gradient driven purely by on-body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity -- allocating minimal capacity on simple flat terrain and expanding to higher-rank updates on demanding uneven surfaces -- enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage-point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within $\sim$1{,}800 walking steps at 7.4,ms end-to-end latency. Our code implementation is available at https://github.com/FastLM/OLIVE.

CVApr 26, 2023Code
Customized Segment Anything Model for Medical Image Segmentation

Kaidong Zhang, Dong Liu

We propose SAMed, a general solution for medical image segmentation. Different from the previous methods, SAMed is built upon the large-scale image segmentation model, Segment Anything Model (SAM), to explore the new research paradigm of customizing large-scale models for medical image segmentation. SAMed applies the low-rank-based (LoRA) finetuning strategy to the SAM image encoder and finetunes it together with the prompt encoder and the mask decoder on labeled medical image segmentation datasets. We also observe the warmup finetuning strategy and the AdamW optimizer lead SAMed to successful convergence and lower loss. Different from SAM, SAMed could perform semantic segmentation on medical images. Our trained SAMed model achieves 81.88 DSC and 20.64 HD on the Synapse multi-organ segmentation dataset, which is on par with the state-of-the-art methods. We conduct extensive experiments to validate the effectiveness of our design. Since SAMed only updates a small fraction of the SAM parameters, its deployment cost and storage cost are quite marginal in practical usage. The code of SAMed is available at https://github.com/hitachinsk/SAMed.

CVMay 8, 2022Code
Recurrent Dynamic Embedding for Video Object Segmentation

Mingxing Li, Li Hu, Zhiwei Xiong et al.

Space-time memory (STM) based video object segmentation (VOS) networks usually keep increasing memory bank every several frames, which shows excellent performance. However, 1) the hardware cannot withstand the ever-increasing memory requirements as the video length increases. 2) Storing lots of information inevitably introduces lots of noise, which is not conducive to reading the most important information from the memory bank. In this paper, we propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size. Specifically, we explicitly generate and update RDE by the proposed Spatio-temporal Aggregation Module (SAM), which exploits the cue of historical information. To avoid error accumulation owing to the recurrent usage of SAM, we propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos. Moreover, the predicted masks in the memory bank are inaccurate due to the inaccurate network inference, which affects the segmentation of the query frame. To address this problem, we design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank. Extensive experiments show our method achieves the best tradeoff between performance and speed. Code is available at https://github.com/Limingxing00/RDE-VOS-CVPR2022.

CVAug 14, 2022Code
Flow-Guided Transformer for Video Inpainting

Kaidong Zhang, Jingjing Fu, Dong Liu

We propose a flow-guided transformer, which innovatively leverage the motion discrepancy exposed by optical flows to instruct the attention retrieval in transformer for high fidelity video inpainting. More specially, we design a novel flow completion network to complete the corrupted flows by exploiting the relevant flow features in a local temporal window. With the completed flows, we propagate the content across video frames, and adopt the flow-guided transformer to synthesize the rest corrupted regions. We decouple transformers along temporal and spatial dimension, so that we can easily integrate the locally relevant completed flows to instruct spatial attention only. Furthermore, we design a flow-reweight module to precisely control the impact of completed flows on each spatial transformer. For the sake of efficiency, we introduce window partition strategy to both spatial and temporal transformers. Especially in spatial transformer, we design a dual perspective spatial MHSA, which integrates the global tokens to the window-based attention. Extensive experiments demonstrate the effectiveness of the proposed method qualitatively and quantitatively. Codes are available at https://github.com/hitachinsk/FGT.

CVMar 5, 2023
PyramidFlow: High-Resolution Defect Contrastive Localization using Pyramid Normalizing Flow

Jiarui Lei, Xiaobo Hu, Yue Wang et al.

During industrial processing, unforeseen defects may arise in products due to uncontrollable factors. Although unsupervised methods have been successful in defect localization, the usual use of pre-trained models results in low-resolution outputs, which damages visual performance. To address this issue, we propose PyramidFlow, the first fully normalizing flow method without pre-trained models that enables high-resolution defect localization. Specifically, we propose a latent template-based defect contrastive localization paradigm to reduce intra-class variance, as the pre-trained models do. In addition, PyramidFlow utilizes pyramid-like normalizing flows for multi-scale fusing and volume normalization to help generalization. Our comprehensive studies on MVTecAD demonstrate the proposed method outperforms the comparable algorithms that do not use external priors, even achieving state-of-the-art performance in more challenging BTAD scenarios.

IVMar 11, 2022
aiWave: Volumetric Image Compression with 3-D Trained Affine Wavelet-like Transform

Dongmei Xue, Haichuan Ma, Li Li et al.

Volumetric image compression has become an urgent task to effectively transmit and store images produced in biological research and clinical practice. At present, the most commonly used volumetric image compression methods are based on wavelet transform, such as JP3D. However, JP3D employs an ideal, separable, global, and fixed wavelet basis to convert input images from pixel domain to frequency domain, which seriously limits its performance. In this paper, we first design a 3-D trained wavelet-like transform to enable signal-dependent and non-separable transform. Then, an affine wavelet basis is introduced to capture the various local correlations in different regions of volumetric images. Furthermore, we embed the proposed wavelet-like transform to an end-to-end compression framework called aiWave to enable an adaptive compression scheme for various datasets. Last but not least, we introduce the weight sharing strategies of the affine wavelet-like transform according to the volumetric data characteristics in the axial direction to reduce the amount of parameters. The experimental results show that: 1) when cooperating our trained 3-D affine wavelet-like transform with a simple factorized entropy module, aiWave performs better than JP3D and is comparable in terms of encoding and decoding complexities; 2) when adding a context module to further remove signal redundancy, aiWave can achieve a much better performance than HEVC.

99.8ROApr 13Code
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie et al.

Despite the critical role of bimanual manipulation in endowing robots with human-like dexterity, large-scale and diverse datasets remain scarce due to the significant hardware heterogeneity across bimanual robotic platforms. To bridge this gap, we introduce RoboCOIN, a large-scale multi-embodiment bimanual manipulation dataset comprising over 180,000 demonstrations collected from 15 distinct robotic platforms. Spanning 16 diverse environments-including residential, commercial, and industrial settings-the dataset features 421 bimanual tasks systematically categorized by 39 bimanual collaboration actions and 432 objects. A key innovation of our work is the hierarchical capability pyramid, which provides granular annotations ranging from trajectory-level concepts to segment-level subtasks and frame-level kinematics. Furthermore, we present CoRobot, an efficient data processing pipeline powered by the Robot Trajectory Markup Language (RTML), designed to facilitate quality assessment, automated annotation, and unified multi-embodiment and data management. Extensive experiments demonstrate the effectiveness of RoboCOIN in enhancing the performance of various bimanual manipulation models across a wide spectrum of robotic embodiments. The entire dataset and codebase are fully open-sourced, providing a valuable resource for advancing research in bimanual and multi-embodiment manipulation.

CVJul 11, 2023
Offline and Online Optical Flow Enhancement for Deep Video Compression

Chuanbo Tang, Xihua Sheng, Zhuoyuan Li et al.

Video compression relies heavily on exploiting the temporal redundancy between video frames, which is usually achieved by estimating and using the motion information. The motion information is represented as optical flows in most of the existing deep video compression networks. Indeed, these networks often adopt pre-trained optical flow estimation networks for motion estimation. The optical flows, however, may be less suitable for video compression due to the following two factors. First, the optical flow estimation networks were trained to perform inter-frame prediction as accurately as possible, but the optical flows themselves may cost too many bits to encode. Second, the optical flow estimation networks were trained on synthetic data, and may not generalize well enough to real-world videos. We address the twofold limitations by enhancing the optical flows in two stages: offline and online. In the offline stage, we fine-tune a trained optical flow estimation network with the motion information provided by a traditional (non-deep) video compression scheme, e.g. H.266/VVC, as we believe the motion information of H.266/VVC achieves a better rate-distortion trade-off. In the online stage, we further optimize the latent features of the optical flows with a gradient descent-based algorithm for the video to be compressed, so as to enhance the adaptivity of the optical flows. We conduct experiments on a state-of-the-art deep video compression scheme, DCVC. Experimental results demonstrate that the proposed offline and online enhancement together achieves on average 12.8% bitrate saving on the tested videos, without increasing the model or computational complexity of the decoder side.

IVJun 19, 2023
VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision

Xihua Sheng, Li Li, Dong Liu et al.

Almost all digital videos are coded into compact representations before being transmitted. Such compact representations need to be decoded back to pixels before being displayed to humans and - as usual - before being enhanced/analyzed by machine vision algorithms. Intuitively, it is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. Therefore, we propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis, thereby being versatile for both human and machine vision. Our VNVC framework has a feature-based compression loop. In the loop, one frame is encoded into compact representations and decoded to an intermediate feature that is obtained before performing reconstruction. The intermediate feature can be used as reference in motion compensation and motion estimation through feature-based temporal context mining and cross-domain motion encoder-decoder to compress the following frames. The intermediate feature is directly fed into video reconstruction, video enhancement, and video analysis networks to evaluate its effectiveness. The evaluation shows that our framework with the intermediate feature achieves high compression efficiency for video reconstruction and satisfactory task performances with lower complexities.

CVAug 6, 2023
Learning Fine-Grained Features for Pixel-wise Video Correspondences

Rui Li, Shenglong Zhou, Dong Liu

Video analysis tasks rely heavily on identifying the pixels from different frames that correspond to the same visual target. To tackle this problem, recent studies have advocated feature learning methods that aim to learn distinctive representations to match the pixels, especially in a self-supervised fashion. Unfortunately, these methods have difficulties for tiny or even single-pixel visual targets. Pixel-wise video correspondences were traditionally related to optical flows, which however lead to deterministic correspondences and lack robustness on real-world videos. We address the problem of learning features for establishing pixel-wise correspondences. Motivated by optical flows as well as the self-supervised feature learning, we propose to use not only labeled synthetic videos but also unlabeled real-world videos for learning fine-grained representations in a holistic framework. We adopt an adversarial learning scheme to enhance the generalization ability of the learned features. Moreover, we design a coarse-to-fine framework to pursue high computational efficiency. Our experimental results on a series of correspondence-based tasks demonstrate that the proposed method outperforms state-of-the-art rivals in both accuracy and efficiency.

CVMar 17, 2022
Neural Compression-Based Feature Learning for Video Restoration

Cong Huang, Jiahao Li, Bin Li et al.

How to efficiently utilize the temporal features is crucial, yet challenging, for video restoration. The temporal features usually contain various noisy and uncorrelated information, and they may interfere with the restoration of the current frame. This paper proposes learning noise-robust feature representations to help video restoration. We are inspired by that the neural codec is a natural denoiser. In neural codec, the noisy and uncorrelated contents which are hard to predict but cost lots of bits are more inclined to be discarded for bitrate saving. Therefore, we design a neural compression module to filter the noise and keep the most useful information in features for video restoration. To achieve robustness to noise, our compression module adopts a spatial channel-wise quantization mechanism to adaptively determine the quantization step size for each position in the latent. Experiments show that our method can significantly boost the performance on video denoising, where we obtain 0.13 dB improvement over BasicVSR++ with only 0.23x FLOPs. Meanwhile, our method also obtains SOTA results on video deraining and dehazing.

IVJun 7, 2023Code
A Dataset for Deep Learning-based Bone Structure Analyses in Total Hip Arthroplasty

Kaidong Zhang, Ziyang Gan, Dong Liu et al.

Total hip arthroplasty (THA) is a widely used surgical procedure in orthopedics. For THA, it is of clinical significance to analyze the bone structure from the CT images, especially to observe the structure of the acetabulum and femoral head, before the surgical procedure. For such bone structure analyses, deep learning technologies are promising but require high-quality labeled data for the learning, while the data labeling is costly. We address this issue and propose an efficient data annotation pipeline for producing a deep learning-oriented dataset. Our pipeline consists of non-learning-based bone extraction (BE) and acetabulum and femoral head segmentation (AFS) and active-learning-based annotation refinement (AAR). For BE we use the classic graph-cut algorithm. For AFS we propose an improved algorithm, including femoral head boundary localization using first-order and second-order gradient regularization, line-based non-maximum suppression, and anatomy prior-based femoral head extraction. For AAR, we refine the algorithm-produced pseudo labels with the help of trained deep models: we measure the uncertainty based on the disagreement between the original pseudo labels and the deep model predictions, and then find out the samples with the largest uncertainty to ask for manual labeling. Using the proposed pipeline, we construct a large-scale bone structure analyses dataset from more than 300 clinical and diverse CT scans. We perform careful manual labeling for the test set of our data. We then benchmark multiple state-of-the art deep learning-based methods of medical image segmentation using the training and test sets of our data. The extensive experimental results validate the efficacy of the proposed data annotation pipeline. The dataset, related codes and models will be publicly available at https://github.com/hitachinsk/THA.

CVSep 16, 2022
Spatial-then-Temporal Self-Supervised Learning for Video Correspondence

Rui Li, Dong Liu

In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.

CVJan 24, 2023
Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting

Kaidong Zhang, Jialun Peng, Jingjing Fu et al.

Transformers have been widely used for video processing owing to the multi-head self attention (MHSA) mechanism. However, the MHSA mechanism encounters an intrinsic difficulty for video inpainting, since the features associated with the corrupted regions are degraded and incur inaccurate self attention. This problem, termed query degradation, may be mitigated by first completing optical flows and then using the flows to guide the self attention, which was verified in our previous work - flow-guided transformer (FGT). We further exploit the flow guidance and propose FGT++ to pursue more effective and efficient video inpainting. First, we design a lightweight flow completion network by using local aggregation and edge loss. Second, to address the query degradation, we propose a flow guidance feature integration module, which uses the motion discrepancy to enhance the features, together with a flow-guided feature propagation module that warps the features according to the flows. Third, we decouple the transformer along the temporal and spatial dimensions, where flows are used to select the tokens through a temporally deformable MHSA mechanism, and global tokens are combined with the inner-window local tokens through a dual perspective MHSA mechanism. FGT++ is experimentally evaluated to be outperforming the existing video inpainting networks qualitatively and quantitatively.

81.7CVMay 20Code
Towards Large Model Feature Coding

Youwei Pang, Changsheng Gao, Dong Liu et al.

Large models have delivered remarkable performance across a wide range of perception and generation tasks, yet practical deployment is increasingly constrained by computational and memory budgets, as well as privacy requirements. Split execution alleviates these constraints by partitioning computation across devices, but it inevitably introduces intensive transmission and storage of intermediate features. Unlike conventional feature coding for CNNs that typically targets homogeneous spatial activation maps, modern large models generate heterogeneous features with varying statistical distributions and compression tolerances, e.g., multi-level/multi-modal representations and autoregressive context caches. These characteristics necessitate treating large model feature coding (LaMoFC) as a fundamental system component and call for a systematic evaluation framework. In this paper, we present a comprehensive benchmark and evaluation framework for LaMoFC. We first build the feature dataset LaMoFCBench, covering diverse task requirements across 4 categories and 16 scenarios while integrating widelyadopted architectures and various split-computing settings. We then specify representative split points according to practical application scenarios to extract intermediate features, establishing a unified pipeline for fair and reproducible comparisons. Finally, we benchmark mainstream universal feature codecs, exposing the profound misalignment between existing coding paradigms and the heterogeneous nature of large model features. These findings reveal that LaMoFC demands a fundamental departure from existing paradigms, and LaMoFCBench provides the shared empirical foundation to drive this transition. The data and code will be available at https://github.com/lartpang/LaMoFCBench.

98.0CLApr 27Code
Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

Dong Liu, Yanxuan Yu, Ying Nian Wu

The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce \textbf{Thoughts-as-Planning}, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts-as-Planning.

CVNov 30, 2025Code
MM-ACT: Learn from Multimodal Parallel Generation to Act

Haotian Liang, Xinyi Chen, Bin Wang et al.

A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.

DCSep 3, 2024Code
Designing Large Foundation Models for Efficient Training and Inference: A Survey

Dong Liu, Yanxuan Yu, Yite Wang et al.

This paper focuses on modern efficient training and inference technologies on foundation models and illustrates them from two perspectives: model and system design. Model and System Design optimize LLM training and inference from different aspects to save computational resources, making LLMs more efficient, affordable, and more accessible. The paper list repository is available at https://github.com/NoakLiu/Efficient-Foundation-Models-Survey.

IVSep 13, 2024
USTC-TD: A Test Dataset and Benchmark for Image and Video Coding in 2020s

Zhuoyuan Li, Junqi Liao, Chuanbo Tang et al.

Image/video coding has been a remarkable research area for both academia and industry for many years. Testing datasets, especially high-quality image/video datasets are desirable for the justified evaluation of coding-related research, practical applications, and standardization activities. We put forward a test dataset namely USTC-TD, which has been successfully adopted in the practical end-to-end image/video coding challenge of the IEEE International Conference on Visual Communications and Image Processing (VCIP) in 2022 and 2023. USTC-TD contains 40 images at 4K spatial resolution and 10 video sequences at 1080p spatial resolution, featuring various content due to the diverse environmental factors (e.g. scene type, texture, motion, view) and the designed imaging factors (e.g. illumination, lens, shadow). We quantitatively evaluate USTC-TD on different image/video features (spatial, temporal, color, lightness), and compare it with the previous image/video test datasets, which verifies its excellent compensation for the shortcomings of existing datasets. We also evaluate both classic standardized and recently learned image/video coding schemes on USTC-TD using objective quality metrics (PSNR, MS-SSIM, VMAF) and subjective quality metric (MOS), providing an extensive benchmark for these evaluated schemes. Based on the characteristics and specific design of the proposed test dataset, we analyze the benchmark performance and shed light on the future research and development of image/video coding. All the data are released online: https://esakak.github.io/USTC-TD.

33.0CVMay 27
Bound-Constrained Sparse Representation for Electrical Impedance Tomography

Chun Zhang, Dong Liu

This study proposes a bound-constrained sparse representation (BC-SR) framework for electrical impedance tomography (EIT), aimed at improving conductivity estimation without explicit regularization. BC-SR adopts a representation-driven strategy, generating conductivity from low-dimensional latent variables via an implicit composite parameterization. Structural priors are embedded using a truncated graph-Laplacian basis, while a bound-preserving nonlinear mapping enforces admissible conductivity ranges and improves conditioning through implicit gradient modulation. The approach ensures robust convergence, even under noisy or incomplete data. Extensive validation on 2D/3D simulations, tank experiments, and in-vivo lung data shows that BC-SR improves physical consistency and structural fidelity, offering enhanced robustness compared to traditional methods. Additionally, BC-SR enables 3D time-difference EIT reconstruction, offering improved spatial resolution and a more coherent representation of 3D conductivity distributions, particularly for in-vivo lung data. This suggests potential for improved performance in EIT, particularly in clinical applications for respiratory monitoring.

CVAug 4, 2023
DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

Haowen Wang, Zhipeng Fan, Zhen Zhao et al.

Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.

78.0IVMay 15Code
TVRN: Invertible Neural Networks for Compression-Aware Temporal Video Rescaling

Xinmin Feng, Li Li, Dong Liu et al.

To fit diverse display and bandwidth constraints, high-frame-rate videos are temporally downscaled to low-frame-rate (LFR) and later upscaled, requiring joint optimization for effective frame-rate rescaling. However, existing methods typically link the two operations via training objectives, without fully exploiting their reciprocal nature, which may cause high-frequency information loss. Moreover, they overlook the impact of lossy codecs on LFR videos, limiting real-world applicability. In this work, we propose an end-to-end framework for compression-aware frame-rate rescaling, named TVRN. To regularize high-frequency information lost during frame-rate downscaling, TVRN adopts an invertible architecture that combines a Multi-Input Multi-Output Temporal Wavelet Transform with a high-frequency reconstruction module. To enable end-to-end training through non-differentiable lossy codecs, we design a surrogate network that approximates their gradients. Finally, to improve robustness under various compression levels, we extend TVRN to an asymmetric architecture by incorporating compression-aware features learned via a learning-to-rank strategy. Extensive experiments show that TVRN outperforms existing methods in reconstruction quality under industrial video compression settings. Source code is publicly available at https://github.com/fengxinmin/TVRN_public.

CVJun 1, 2023
Towards Interactive Image Inpainting via Sketch Refinement

Chang Liu, Shunxin Xu, Jialun Peng et al.

One tough problem of image inpainting is to restore complex structures in the corrupted regions. It motivates interactive image inpainting which leverages additional hints, e.g., sketches, to assist the inpainting process. Sketch is simple and intuitive to end users, but meanwhile has free forms with much randomness. Such randomness may confuse the inpainting models, and incur severe artifacts in completed images. To address this problem, we propose a two-stage image inpainting method termed SketchRefiner. In the first stage, we propose using a cross-correlation loss function to robustly calibrate and refine the user-provided sketches in a coarse-to-fine fashion. In the second stage, we learn to extract informative features from the abstracted sketches in the feature space and modulate the inpainting process. We also propose an algorithm to simulate real sketches automatically and build a test protocol with different applications. Experimental results on public datasets demonstrate that SketchRefiner effectively utilizes sketch information and eliminates the artifacts due to the free-form sketches. Our method consistently outperforms the state-of-the-art ones both qualitatively and quantitatively, meanwhile revealing great potential in real-world applications. Our code and dataset are available.

39.7CLMar 28Code
Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search

Dong Liu, Yanxuan Yu

Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbf{Semantic Pyramid Indexing (SPI)}, a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf{5.7$\times$} retrieval speedup and \textbf{1.8$\times$} memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf{2.5 points} compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework's compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \href{https://github.com/FastLM/SPI_VecDB}{https://github.com/FastLM/SPI\_VecDB}.

CVApr 11, 2023
Mask-Based Modeling for Neural Radiance Fields

Ganlin Yang, Guoqiang Wei, Zhizheng Zhang et al.

Most Neural Radiance Fields (NeRFs) exhibit limited generalization capabilities, which restrict their applicability in representing multiple scenes using a single model. To address this problem, existing generalizable NeRF methods simply condition the model on image features. These methods still struggle to learn precise global representations over diverse scenes since they lack an effective mechanism for interacting among different points and views. In this work, we unveil that 3D implicit representation learning can be significantly improved by mask-based modeling. Specifically, we propose masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray. With this pretraining target, MRVM-NeRF enables better use of correlations across different points and views as the geometry priors, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes. Extensive experiments demonstrate the effectiveness of our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively and quantitatively. Besides, we also conduct experiments to show the compatibility of our proposed method with various backbones and its superiority under few-shot cases.

CVAug 16, 2024
Bi-Directional Deep Contextual Video Compression

Xihua Sheng, Li Li, Dong Liu et al.

Deep video compression has made remarkable process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this paper, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, termed DCVC-B, to improve the compression performance of deep B-frame coding. Our scheme mainly has three key innovations. First, we develop a bi-directional motion difference context propagation method for effective motion difference coding, which significantly reduces the bit cost of bi-directional motions. Second, we propose a bi-directional contextual compression model and a corresponding bi-directional temporal entropy model, to make better use of the multi-scale temporal contexts. Third, we propose a hierarchical quality structure-based training strategy, leading to an effective bit allocation across large groups of pictures (GOP). Experimental results show that our DCVC-B achieves an average reduction of 26.6% in BD-Rate compared to the reference software for H.265/HEVC under random access conditions. Remarkably, it surpasses the performance of the H.266/VVC reference software on certain test datasets under the same configuration. We anticipate our work can provide valuable insights and bring up deep B-frame coding to the next level.

IVJul 16, 2024
Uniformly Accelerated Motion Model for Inter Prediction

Zhuoyuan Li, Yao Li, Chuanbo Tang et al.

Inter prediction is a key technology to reduce the temporal redundancy in video coding. In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly. In Versatile Video Coding (VVC), existing inter prediction methods usually assume uniform speed motion between consecutive frames and use the linear models for motion estimation (ME) and motion compensation (MC), which may not well handle the complex motion fields in the real world. To address these issues, we introduce a uniformly accelerated motion model (UAMM) to exploit motion-related elements (velocity, acceleration) of moving objects between the video frames, and further combine them to assist the inter prediction methods to handle the variable motion in the temporal domain. Specifically, first, the theory of UAMM is mentioned. Second, based on that, we propose the UAMM-based parameter derivation and extrapolation schemes in the coding process. Third, we integrate the UAMM into existing inter prediction modes (Merge, MMVD, CIIP) to achieve higher prediction accuracy. The proposed method is implemented into the VVC reference software, VTM version 12.0. Experimental results show that the proposed method achieves up to 0.38% and on average 0.13% BD-rate reduction compared to the VTM anchor, under the Low-delay P configuration, with a slight increase of time complexity on the encoding/decoding side.

CVJul 22, 2023
On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement

Xin Luo, Yunan Zhu, Shunxin Xu et al.

Several recent studies advocate the use of spectral discriminators, which evaluate the Fourier spectra of images for generative modeling. However, the effectiveness of the spectral discriminators is not well interpreted yet. We tackle this issue by examining the spectral discriminators in the context of perceptual image super-resolution (i.e., GAN-based SR), as SR image quality is susceptible to spectral changes. Our analyses reveal that the spectral discriminator indeed performs better than the ordinary (a.k.a. spatial) discriminator in identifying the differences in the high-frequency range; however, the spatial discriminator holds an advantage in the low-frequency range. Thus, we suggest that the spectral and spatial discriminators shall be used simultaneously. Moreover, we improve the spectral discriminators by first calculating the patch-wise Fourier spectrum and then aggregating the spectra by Transformer. We verify the effectiveness of the proposed method twofold. On the one hand, thanks to the additional spectral discriminator, our obtained SR images have their spectra better aligned to those of the real images, which leads to a better PD tradeoff. On the other hand, our ensembled discriminator predicts the perceptual quality more accurately, as evidenced in the no-reference image quality assessment task.

CVJul 28, 2024
NVC-1B: A Large Neural Video Coding Model

Xihua Sheng, Chuanbo Tang, Li Li et al.

The emerging large models have achieved notable progress in the fields of natural language processing and computer vision. However, large models for neural video coding are still unexplored. In this paper, we try to explore how to build a large neural video coding model. Based on a small baseline model, we gradually scale up the model sizes of its different coding parts, including the motion encoder-decoder, motion entropy model, contextual encoder-decoder, contextual entropy model, and temporal context mining module, and analyze the influence of model sizes on video compression performance. Then, we explore to use different architectures, including CNN, mixed CNN-Transformer, and Transformer architectures, to implement the neural video coding model and analyze the influence of model architectures on video compression performance. Based on our exploration results, we design the first neural video coding model with more than 1 billion parameters -- NVC-1B. Experimental results show that our proposed large model achieves a significant video compression performance improvement over the small baseline model, and represents the state-of-the-art compression efficiency. We anticipate large models may bring up the video coding technologies to the next level.

CVSep 29, 2023
On Uniform Scalar Quantization for Learned Image Compression

Haotian Zhang, Li Li, Dong Liu

Learned image compression possesses a unique challenge when incorporating non-differentiable quantization into the gradient-based training of the networks. Several quantization surrogates have been proposed to fulfill the training, but they were not systematically justified from a theoretical perspective. We fill this gap by contrasting uniform scalar quantization, the most widely used category with rounding being its simplest case, and its training surrogates. In principle, we find two factors crucial: one is the discrepancy between the surrogate and rounding, leading to train-test mismatch; the other is gradient estimation risk due to the surrogate, which consists of bias and variance of the gradient estimation. Our analyses and simulations imply that there is a tradeoff between the train-test mismatch and the gradient estimation risk, and the tradeoff varies across different network structures. Motivated by these analyses, we present a method based on stochastic uniform annealing, which has an adjustable temperature coefficient to control the tradeoff. Moreover, our analyses enlighten us as to two subtle tricks: one is to set an appropriate lower bound for the variance parameter of the estimated quantized latent distribution, which effectively reduces the train-test mismatch; the other is to use zero-center quantization with partial stop-gradient, which reduces the gradient estimation variance and thus stabilize the training. Our method with the tricks is verified to outperform the existing practices of quantization surrogates on a variety of representative image compression networks.

IVJul 15, 2024
In-Loop Filtering via Trained Look-Up Tables

Zhuoyuan Li, Jiacheng Li, Yao Li et al.

In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time and computational complexity, and high demands of high-performance hardware, which is challenging to apply to the general uses of coding scene. To address this limitation, inspired by explorations in image restoration, we propose an efficient and practical in-loop filtering scheme by adopting the Look-up Table (LUT). We train the DNN of in-loop filtering within a fixed filtering reference range, and cache the output values of the DNN into a LUT via traversing all possible inputs. At testing time in the coding process, the filtered pixel is generated by locating input pixels (to-be-filtered pixel with reference pixels) and interpolating cached filtered pixel values. To further enable the large filtering reference range with the limited storage cost of LUT, we introduce the enhanced indexing mechanism in the filtering process, and clipping/finetuning mechanism in the training. The proposed method is implemented into the Versatile Video Coding (VVC) reference software, VTM-11.0. Experimental results show that the ultrafast, very fast, and fast mode of the proposed method achieves on average 0.13%/0.34%/0.51%, and 0.10%/0.27%/0.39% BD-rate reduction, under the all intra (AI) and random access (RA) configurations. Especially, our method has friendly time and computational complexity, only 101%/102%-104%/108% time increase with 0.13-0.93 kMACs/pixel, and only 164-1148 KB storage cost for a single model. Our solution may shed light on the journey of practical neural network-based coding tool evolution.

28.8NAApr 15
Randomized Neural Networks for Integro-Differential Equations with Application to Neutron Transport

Haoning Dang, Fei Wang, Yifan Chen et al.

Integro-differential equations arise in a wide range of applications, including transport, kinetic theory, radiative transfer, and multiphysics modeling, where nonlocal integral operators couple the solution across phase space. Such nonlocality often introduces dense coupling blocks in deterministic discretizations, leading to increased computational cost and memory usage, while physics-informed neural networks may suffer from expensive nonconvex training and sensitivity to hyperparameter choices. In this work, we present randomized neural networks (RaNNs) as a mesh-free collocation framework for linear integro-differential equations. Because the RaNN approximation is intrinsically dense through globally supported random features, the nonlocal integral operator does not introduce an additional loss of sparsity, while the approximate solution can still be represented with relatively few trainable degrees of freedom. By randomly fixing the hidden-layer parameters and solving only for the linear output weights, the training procedure reduces to a convex least-squares problem in the output coefficients, enabling stable and efficient optimization. As a representative application, we apply the proposed framework to the steady neutron transport equation, a high-dimensional linear integro-differential model featuring scattering integrals and diverse boundary conditions. Extensive numerical experiments demonstrate that, in the reported test settings, the RaNN approach achieves competitive accuracy while incurring substantially lower training cost than the selected neural and deterministic baselines, highlighting RaNNs as a robust and efficient alternative for the numerical simulation of nonlocal linear operators.

CVJan 6, 2025Code
Visual Large Language Models for Generalized and Specialized Applications

Yifan Li, Zhixin Lai, Wentao Bao et al.

Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining increasing attention for building general-purpose VLMs. Despite the significant progress made in VLLMs, the related literature remains limited, particularly from a comprehensive application perspective, encompassing generalized and specialized applications across vision (image, video, depth), action, and language modalities. In this survey, we focus on the diverse applications of VLLMs, examining their using scenarios, identifying ethics consideration and challenges, and discussing future directions for their development. By synthesizing these contents, we aim to provide a comprehensive guide that will pave the way for future innovations and broader applications of VLLMs. The paper list repository is available: https://github.com/JackYFL/awesome-VLLMs.

CVJan 29, 2024Code
Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

Xihua Sheng, Li Li, Dong Liu et al.

Video compression performance is closely related to the accuracy of inter prediction. It tends to be difficult to obtain accurate inter prediction for the local video regions with inconsistent motion and occlusion. Traditional video coding standards propose various technologies to handle motion inconsistency and occlusion, such as recursive partitions, geometric partitions, and long-term references. However, existing learned video compression schemes focus on obtaining an overall minimized prediction error averaged over all regions while ignoring the motion inconsistency and occlusion in local regions. In this paper, we propose a spatial decomposition and temporal fusion based inter prediction for learned video compression. To handle motion inconsistency, we propose to decompose the video into structure and detail (SDD) components first. Then we perform SDD-based motion estimation and SDD-based temporal context mining for the structure and detail components to generate short-term temporal contexts. To handle occlusion, we propose to propagate long-term temporal contexts by recurrently accumulating the temporal information of each historical reference feature and fuse them with short-term temporal contexts. With the SDD-based motion model and long short-term temporal contexts fusion, our proposed learned video codec can obtain more accurate inter prediction. Comprehensive experimental results demonstrate that our codec outperforms the reference software of H.266/VVC on all common test datasets for both PSNR and MS-SSIM.

CVOct 16, 2023
Towards Open-World Co-Salient Object Detection with Generative Uncertainty-aware Group Selective Exchange-Masking

Yang Wu, Shenglong Hu, Huihui Song et al.

The traditional definition of co-salient object detection (CoSOD) task is to segment the common salient objects in a group of relevant images. This definition is based on an assumption of group consensus consistency that is not always reasonable in the open-world setting, which results in robustness issue in the model when dealing with irrelevant images in the inputting image group under the open-word scenarios. To tackle this problem, we introduce a group selective exchange-masking (GSEM) approach for enhancing the robustness of the CoSOD model. GSEM takes two groups of images as input, each containing different types of salient objects. Based on the mixed metric we designed, GSEM selects a subset of images from each group using a novel learning-based strategy, then the selected images are exchanged. To simultaneously consider the uncertainty introduced by irrelevant images and the consensus features of the remaining relevant images in the group, we designed a latent variable generator branch and CoSOD transformer branch. The former is composed of a vector quantised-variational autoencoder to generate stochastic global variables that model uncertainty. The latter is designed to capture correlation-based local features that include group consensus. Finally, the outputs of the two branches are merged and passed to a transformer-based decoder to generate robust predictions. Taking into account that there are currently no benchmark datasets specifically designed for open-world scenarios, we constructed three open-world benchmark datasets, namely OWCoSal, OWCoSOD, and OWCoCA, based on existing datasets. By breaking the group-consistency assumption, these datasets provide effective simulations of real-world scenarios and can better evaluate the robustness and practicality of models.

CVMar 19, 2025Code
Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training

Yunwei Lan, Zhigao Cui, Chang Liu et al.

Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code https://github.com/ywxjm/Diff-Dehazer.

SYFeb 25
Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach

Xu Yang, Chenhui Lin, Xiang Ma et al.

The growing integration of distributed photovoltaics (PVs) into active distribution networks (ADNs) has exacerbated operational challenges, making it imperative to coordinate diverse equipment to mitigate voltage violations and enhance power quality. Although existing data-driven approaches have demonstrated effectiveness in the voltage control problem, they often require extensive trial-and-error exploration and struggle to incorporate heterogeneous information, such as day-ahead forecasts and semantic-based grid codes. Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement learning (RL) agent to achieve two-stage voltage control. In the day-ahead stage, the LLM agent receives coarse region-level forecasts and generates scheduling strategies for on-load tap changer (OLTC) and shunt capacitors (SCs) to regulate the overall voltage profile. Then in the intra-day stage, based on accurate node-level measurements, the RL agent refines terminal voltages by deriving reactive power generation strategies for PV inverters. On top of the LLM-RL collaboration framework, we further propose a self-evolution mechanism for the LLM agent and a pretrain-finetune pipeline for the RL agent, effectively enhancing and coordinating the policies for both agents. The proposed approach not only aligns more closely with practical operational characteristics but also effectively utilizes the inherent knowledge and reasoning capabilities of the LLM agent, significantly improving training efficiency and voltage control performance. Comprehensive comparisons and ablation studies demonstrate the effectiveness of the proposed method.

CVJul 9, 2024
Decomposition Betters Tracking Everything Everywhere

Rui Li, Dong Liu

Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions.

88.0LGMar 24
MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu, Yanxuan Yu, Ben Lengerich et al.

As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable perplexity to MLA while achieving up to 5x faster training throughput and 1.8x lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.

13.4SYMar 27
A data-driven approach for topology correction in low voltage distribution networks with PVs

Dong Liu, Sander Timmerman, Yu Xiang et al.

Most existing phase balancing and topology reconfiguration problems are formulated as mixed-integer optimization problems that depend on network topologies~\cite{10098964,11017695,10571996}. However, these topologies are often inaccurate and outdated for distribution system operators~(DSOs) due to missing recordings, topology maintenance and reconfiguration, such as congestion management ~\cite{vanin2024phase}. Thus, the topology of the low-voltage distribution network (LVDN) needs to be checked and corrected when it is outdated. The increasing uncertainty of distributed energy resources (DERs), including household photovoltaic (PV), heating pumps, etc., impacts the frequency of topology reconfiguration and challenges the correction of the low-voltage distribution network topology~\cite{10026490, 10347462, 10475702}. Moreover, the available smart meter (SM) datasets are often limited due to privacy concerns and random communication channel failure, challenging the topology correction~\cite{9696306, costa2022identification, dande2025consumer}. Synthetic European networks and benchmark models presented in~\cite{birchfield2016grid,2020Non} are benchmarks for research but insufficient to represent the diversity of European LVDNs for practical use by DSOs (e.g., state estimation). Thus, practical topology identification and correction approaches are required for real-time topology updating for active management of LVDNs.

CLJan 22, 2025Code
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning

Bohao Yang, Yingji Zhang, Dong Liu et al.

Recent large language models (LLMs) have advanced table understanding capabilities but rely on converting tables into text sequences. While multimodal large language models (MLLMs) enable direct visual processing, they face limitations in handling scientific tables due to fixed input image resolutions and insufficient numerical reasoning capabilities. We present a comprehensive framework for multimodal scientific table understanding and reasoning with dynamic input image resolutions. Our framework consists of three key components: (1) MMSci-Pre, a domain-specific table structure learning dataset of 52K scientific table structure recognition samples, (2) MMSci-Ins, an instruction tuning dataset with 12K samples across three table-based tasks, and (3) MMSci-Eval, a benchmark with 3,114 testing samples specifically designed to evaluate numerical reasoning capabilities. Extensive experiments demonstrate that our domain-specific approach with 52K scientific table images achieves superior performance compared to 150K general-domain tables, highlighting the importance of data quality over quantity. Our proposed table-based MLLMs with dynamic input resolutions show significant improvements in both general table understanding and numerical reasoning capabilities, with strong generalisation to held-out datasets. Our code and data are publicly available at https://github.com/Bernard-Yang/MMSci_Table.

45.2IVApr 30
A Proof-of-Concept Study of Multitask Learning for Cranial Synthetic CT Generation Across Heterogeneous MRI Field Strengths

Zhuoyao Xin, Yiren Zhang, Christopher Wu et al.

Accurate synthesis of computed tomography (CT) images from magnetic resonance imaging (MRI) is clinically valuable for cranial applications such as attenuation correction, radiotherapy planning, and image-guided interventions. However, heterogeneity across MRI field strengths and acquisition protocols limits the generalizability of existing methods. In this study, we formulate cranial CT synthesis as a modular, structurally coupled problem and propose a deep learning framework to improve robustness across heterogeneous MRI conditions. The model is designed to adapt to variations in field strength and imaging protocols while preserving anatomical consistency. Experiments on multi-site datasets demonstrate improved performance and generalization compared with conventional approaches. The proposed method enables reliable CT synthesis across heterogeneous MRI settings, supporting broader clinical translation.

IVMay 20, 2025Code
Neural Video Compression with Context Modulation

Chuanbo Tang, Zhuoyuan Li, Yifan Bian et al.

Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM. The code is available at https://github.com/Austin4USTC/DCMVC.

IVSep 17, 2024
Few-Shot Domain Adaptation for Learned Image Compression

Tianyu Zhang, Haotian Zhang, Yuqi Li et al.

Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than $2\%$ of the parameters.

78.8LGApr 24Code
Accelerating Frequency Domain Diffusion Models with Error-Feedback Event-Driven Caching

Dong Liu, Haisheng Wang, Yanxuan Yu

Diffusion models achieve remarkable success in time series generation. However, slow inference limits their practical deployment. We propose E$^2$-CRF (Error-Feedback Event-Driven Cumulative Residual Feature caching) to accelerate frequency domain diffusion models. Our method exploits two structural properties: (1) spectral localization, where signal energy concentrates in low frequencies, and (2) mirror symmetry, which halves the effective frequency dimension. E$^2$-CRF uses a closed-loop error-feedback system that adaptively caches transformer KV features across diffusion steps. We trigger recomputation using event-driven residual dynamics instead of fixed schedules. Our method selectively recomputes high-energy or rapidly-changing tokens while reusing cached features for stable high-frequency components. E$^2$-CRF achieves ~2.2 speedup while maintaining sample quality. We demonstrate effectiveness on 5 datasets. Our caching strategy naturally aligns with the diffusion process's structure-to-detail progression. We include sufficient-condition error and complexity bounds under standard regularity assumptions (Appendix), alongside empirical validation. Our code is available at https://github.com/NoakLiu/FastFourierDiffusion and is also integrated in https://github.com/NoakLiu/FastCache-xDiT.

CVFeb 14, 2025Code
Conditional Latent Coding with Learnable Synthesized Reference for Deep Image Compression

Siqi Wu, Yinda Chen, Dong Liu et al.

In this paper, we study how to synthesize a dynamic reference from an external dictionary to perform conditional coding of the input image in the latent domain and how to learn the conditional latent synthesis and coding modules in an end-to-end manner. Our approach begins by constructing a universal image feature dictionary using a multi-stage approach involving modified spatial pyramid pooling, dimension reduction, and multi-scale feature clustering. For each input image, we learn to synthesize a conditioning latent by selecting and synthesizing relevant features from the dictionary, which significantly enhances the model's capability in capturing and exploring image source correlation. This conditional latent synthesis involves a correlation-based feature matching and alignment strategy, comprising a Conditional Latent Matching (CLM) module and a Conditional Latent Synthesis (CLS) module. The synthesized latent is then used to guide the encoding process, allowing for more efficient compression by exploiting the correlation between the input image and the reference dictionary. According to our theoretical analysis, the proposed conditional latent coding (CLC) method is robust to perturbations in the external dictionary samples and the selected conditioning latent, with an error bound that scales logarithmically with the dictionary size, ensuring stability even with large and diverse dictionaries. Experimental results on benchmark datasets show that our new method improves the coding performance by a large margin (up to 1.2 dB) with a very small overhead of approximately 0.5\% bits per pixel. Our code is publicly available at https://github.com/ydchen0806/CLC.

IVSep 6, 2024
Diff-INR: Generative Regularization for Electrical Impedance Tomography

Bowen Tong, Junwu Wang, Dong Liu

Electrical Impedance Tomography (EIT) is a non-invasive imaging technique that reconstructs conductivity distributions within a body from boundary measurements. However, EIT reconstruction is hindered by its ill-posed nonlinear inverse problem, which complicates accurate results. To tackle this, we propose Diff-INR, a novel method that combines generative regularization with Implicit Neural Representations (INR) through a diffusion model. Diff-INR introduces geometric priors to guide the reconstruction, effectively addressing the shortcomings of traditional regularization methods. By integrating a pre-trained diffusion regularizer with INR, our approach achieves state-of-the-art reconstruction accuracy in both simulation and experimental data. The method demonstrates robust performance across various mesh densities and hyperparameter settings, highlighting its flexibility and efficiency. This advancement represents a significant improvement in managing the ill-posed nature of EIT. Furthermore, the method's principles are applicable to other imaging modalities facing similar challenges with ill-posed inverse problems.

IVJul 25, 2025Code
Learned Image Compression with Hierarchical Progressive Context Modeling

Yuqi Li, Haotian Zhang, Li Li et al.

Context modeling is essential in learned image compression for accurately estimating the distribution of latents. While recent advanced methods have expanded context modeling capacity, they still struggle to efficiently exploit long-range dependency and diverse context information across different coding steps. In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. Specifically, HPCM employs a hierarchical coding schedule to sequentially model the contextual dependencies among latents at multiple scales, which enables more efficient long-range context modeling. Furthermore, we propose a progressive context fusion mechanism that incorporates contextual information from previous coding steps into the current step, effectively exploiting diverse contextual information. Experimental results demonstrate that our method achieves state-of-the-art rate-distortion performance and strikes a better balance between compression performance and computational complexity. The code is available at https://github.com/lyq133/LIC-HPCM.

IVMay 8, 2025Code
Augmented Deep Contexts for Spatially Embedded Video Coding

Yifan Bian, Chuanbo Tang, Li Li et al.

Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.

DCAug 2, 2025Code
PiKV: KV Cache Management System for Mixture of Experts

Dong Liu, Yanxuan Yu, Ben Lengerich et al.

As large language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: \href{https://github.com/NoakLiu/PiKV}{https://github.com/NoakLiu/PiKV}. Experiments details is recorded at: \href{https://github.com/NoakLiu/PiKV/blob/main/downstream_tasks/README.md}{https://github.com/NoakLiu/PiKV/Experimental\_Results}. We also have PiKV integrated with Nvidia kvpress for acceleration, details see \href{https://github.com/NoakLiu/PiKVpress}{https://github.com/NoakLiu/PiKVpress}. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.