CVMar 13, 2023Code
Efficient Semantic Segmentation by Altering Resolutions for Compressed VideosYubin Hu, Yuze He, Yanghao Li et al.
Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.
CVSep 14, 2023
Judging a video by its bitstream coverYuxing Han, Yunan Ding, Jiangtao Wen et al.
Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially in an age where an immense volume of video content is constantly being generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream. We validate our approach using a custom-built data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our preliminary evaluations indicate precision, accuracy, and recall rates well over 80%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by six orders of magnitude.
LGMay 31, 2025Code
It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMsJun Wu, Yirong Xiong, Jiangtao Wen et al.
Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during training, BackSlash can reduce parameters by up to 90\% with minimal performance loss. Building on this foundational insight, we propose a unified, end-to-end framework for LLM optimization based on the GG model. Our contributions are threefold: (1) GG-based initialization scheme that aligns with the statistical structure of trained models, resulting in faster convergence and improved accuracy; (2) DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile, improving compressibility with minimized degradation in performance; and (3) RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-initialized BackSlash training, enabling low-cost inference without compromising accuracy. Experiments across diverse model architectures show that our framework consistently yields smaller and faster models that match or outperform standard training baselines. By grounding LLM development in principled statistical modeling, this work forges a new path toward efficient, scalable, and hardware-aware AI systems. The code is available on our project page: https://huggingface.co/spaces/shifeng3711/gg_prior.
LGApr 23, 2025
BackSlash: Rate Constrained Optimized Training of Large Language ModelsJun Wu, Jiangtao Wen, Yuxing Han
The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (BackSlash), a novel training-time compression approach based on rate-distortion optimization (RDO). BackSlash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that BackSlash can reduce memory usage by 60% - 90% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, BackSlash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80% pruning rates), and enables network simplification for accelerated inference on edge devices.
ROSep 20, 2025
ReSeFlow: Rectifying SE(3)-Equivariant Policy Learning FlowsZhitao Wang, Yanke Wang, Jiangtao Wen et al.
Robotic manipulation in unstructured environments requires the generation of robust and long-horizon trajectory-level policy with conditions of perceptual observations and benefits from the advantages of SE(3)-equivariant diffusion models that are data-efficient. However, these models suffer from the inference time costs. Inspired by the inference efficiency of rectified flows, we introduce the rectification to the SE(3)-diffusion models and propose the ReSeFlow, i.e., Rectifying SE(3)-Equivariant Policy Learning Flows, providing fast, geodesic-consistent, least-computational policy generation. Crucially, both components employ SE(3)-equivariant networks to preserve rotational and translational symmetry, enabling robust generalization under rigid-body motions. With the verification on the simulated benchmarks, we find that the proposed ReSeFlow with only one inference step can achieve better performance with lower geodesic distance than the baseline methods, achieving up to a 48.5% error reduction on the painting task and a 21.9% reduction on the rotating triangle task compared to the baseline's 100-step inference. This method takes advantages of both SE(3) equivariance and rectified flow and puts it forward for the real-world application of generative policy learning models with the data and inference efficiency.
CVAug 8, 2025
Efficient Bayer-Domain Video Computer Vision with Fast Motion Estimation and Learned Perception ResidualHaichao Wang, Jiangtao Wen, Yuxing Han
Video computer vision systems face substantial computational burdens arising from two fundamental challenges: eliminating unnecessary processing and reducing temporal redundancy in back-end inference while maintaining accuracy with minimal extra computation. To address these issues, we propose an efficient video computer vision framework that jointly optimizes both the front end and back end of the pipeline. On the front end, we remove the traditional image signal processor (ISP) and feed Bayer raw measurements directly into Bayer-domain vision models, avoiding costly human-oriented ISP operations. On the back end, we introduce a fast and highly parallel motion estimation algorithm that extracts inter-frame temporal correspondence to avoid redundant computation. To mitigate artifacts caused by motion inaccuracies, we further employ lightweight perception residual networks that directly learn perception-level residuals and refine the propagated features. Experiments across multiple models and tasks demonstrate that our system achieves substantial acceleration with only minor performance degradation.
CVApr 16, 2025
Flow Intelligence: Robust Feature Matching via Temporal Signature CorrelationJie Wang, Chen Ye Gan, Caoqi Wei et al.
Feature matching across video streams remains a cornerstone challenge in computer vision. Increasingly, robust multimodal matching has garnered interest in robotics, surveillance, remote sensing, and medical imaging. While traditional rely on detecting and matching spatial features, they break down when faced with noisy, misaligned, or cross-modal data. Recent deep learning methods have improved robustness through learned representations, but remain constrained by their dependence on extensive training data and computational demands. We present Flow Intelligence, a paradigm-shifting approach that moves beyond spatial features by focusing on temporal motion patterns exclusively. Instead of detecting traditional keypoints, our method extracts motion signatures from pixel blocks across consecutive frames and extract temporal motion signatures between videos. These motion-based descriptors achieve natural invariance to translation, rotation, and scale variations while remaining robust across different imaging modalities. This novel approach also requires no pretraining data, eliminates the need for spatial feature detection, enables cross-modal matching using only temporal motion, and it outperforms existing methods in challenging scenarios where traditional approaches fail. By leveraging motion rather than appearance, Flow Intelligence enables robust, real-time video feature matching in diverse environments.
CVJan 25, 2025
Vision without Images: End-to-End Computer Vision from Single Compressive MeasurementsFengpu Pan, Heting Gao, Jiangtao Wen et al.
Snapshot Compressed Imaging (SCI) offers high-speed, low-bandwidth, and energy-efficient image acquisition, but remains challenged by low-light and low signal-to-noise ratio (SNR) conditions. Moreover, practical hardware constraints in high-resolution sensors limit the use of large frame-sized masks, necessitating smaller, hardware-friendly designs. In this work, we present a novel SCI-based computer vision framework using pseudo-random binary masks of only 8$\times$8 in size for physically feasible implementations. At its core is CompDAE, a Compressive Denoising Autoencoder built on the STFormer architecture, designed to perform downstream tasks--such as edge detection and depth estimation--directly from noisy compressive raw pixel measurements without image reconstruction. CompDAE incorporates a rate-constrained training strategy inspired by BackSlash to promote compact, compressible models. A shared encoder paired with lightweight task-specific decoders enables a unified multi-task platform. Extensive experiments across multiple datasets demonstrate that CompDAE achieves state-of-the-art performance with significantly lower complexity, especially under ultra-low-light conditions where traditional CMOS and SCI pipelines fail.
CVJan 25, 2025
Leveraging Motion Estimation for Efficient Bayer-Domain Computer VisionHaichao Wang, Xinyue Xi, Jiangtao Wen et al.
Existing computer vision processing pipeline acquires visual information using an image sensor that captures pixel information in the Bayer pattern. The raw sensor data are then processed using an image signal processor (ISP) that first converts Bayer pixel data to RGB on a pixel by pixel basis, followed by video convolutional network (VCN) processing on a frame by frame basis. Both ISP and VCN are computationally expensive with high power consumption and latency. In this paper, we propose a novel framework that eliminates the ISP and leverages motion estimation to accelerate video vision tasks directly in the Bayer domain. We introduce Motion Estimation-based Video Convolution (MEVC), which integrates sliding-window motion estimation into each convolutional layer, enabling prediction and residual-based refinement that reduces redundant computations across frames. This design bridges the structural gap between block-based motion estimation and spatial convolution, enabling accurate, low-cost processing. Our end-to-end pipeline supports raw Bayer input and achieves over 70\% reduction in FLOPs with minimal accuracy degradation across video semantic segmentation, depth estimation, and object detection benchmarks, using both synthetic Bayer-converted and real Bayer video datasets. This framework generalizes across convolution-based models and marks the first effective reuse of motion estimation for accelerating video computer vision directly from raw sensor data.
CVMar 13, 2024
Leveraging Compressed Frame Sizes For Ultra-Fast Video ClassificationYuxing Han, Yunan Ding, Chen Ye Gan et al.
Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially when an immense volume of video content is being constantly generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream decoding. To validate our approach, we built a comprehensive data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our evaluations indicate precision, accuracy, and recall rates consistently above 80%, many exceeding 90%, and some reaching 99%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by seven orders of magnitude.
DLJan 9, 2022
Phocus: Picking Valuable Research from a Sea of CitationsXinrong Zhang, Zihou Ren, Xi Li et al.
The deluge of new papers has significantly blocked the development of academics, which is mainly caused by author-level and publication-level evaluation metrics that only focus on quantity. Those metrics have resulted in several severe problems that trouble scholars focusing on the important research direction for a long time and even promote an impetuous academic atmosphere. To solve those problems, we propose Phocus, a novel academic evaluation mechanism for authors and papers. Phocus analyzes the sentence containing a citation and its contexts to predict the sentiment towards the corresponding reference. Combining others factors, Phocus classifies citations coarsely, ranks all references within a paper, and utilizes the results of the classifier and the ranking model to get the local influential factor of a reference to the citing paper. The global influential factor of the reference to the citing paper is the product of the local influential factor and the total influential factor of the citing paper. Consequently, an author's academic influential factor is the sum of his contributions to each paper he co-authors.
CVMar 10, 2021
Novel tile segmentation scheme for omnidirectional videoJisheng Li, Ziyu Wen, Sihan Li et al.
Regular omnidirectional video encoding technics use map projection to flatten a scene from a spherical shape into one or several 2D shapes. Common projection methods including equirectangular and cubic projection have varying levels of interpolation that create a large number of non-information-carrying pixels that lead to wasted bitrate. In this paper, we propose a tile based omnidirectional video segmentation scheme which can save up to 28% of pixel area and 20% of BD-rate averagely compared to the traditional equirectangular projection based approach.
IVMar 10, 2021
Learning to Estimate Kernel Scale and Orientation of Defocus Blur with Asymmetric Coded ApertureJisheng Li, Qi Dai, Jiangtao Wen
Consistent in-focus input imagery is an essential precondition for machine vision systems to perceive the dynamic environment. A defocus blur severely degrades the performance of vision systems. To tackle this problem, we propose a deep-learning-based framework estimating the kernel scale and orientation of the defocus blur to adjust lens focus rapidly. Our pipeline utilizes 3D ConvNet for a variable number of input hypotheses to select the optimal slice from the input stack. We use random shuffle and Gumbel-softmax to improve network performance. We also propose to generate synthetic defocused images with various asymmetric coded apertures to facilitate training. Experiments are conducted to demonstrate the effectiveness of our framework.
CVMar 10, 2021
Learning to compose 6-DoF omnidirectional videos using multi-sphere imagesJisheng Li, Yuze He, Yubin Hu et al.
Omnidirectional video is an essential component of Virtual Reality. Although various methods have been proposed to generate content that can be viewed with six degrees of freedom (6-DoF), existing systems usually involve complex depth estimation, image in-painting or stitching pre-processing. In this paper, we propose a system that uses a 3D ConvNet to generate a multi-sphere images (MSI) representation that can be experienced in 6-DoF VR. The system utilizes conventional omnidirectional VR camera footage directly without the need for a depth map or segmentation mask, thereby significantly simplifying the overall complexity of the 6-DoF omnidirectional video composition. By using a newly designed weighted sphere sweep volume (WSSV) fusing technique, our approach is compatible with most panoramic VR camera setups. A ground truth generation approach for high-quality artifact-free 6-DoF contents is proposed and can be used by the research and development community for 6-DoF content generation.
CVAug 8, 2020
Online Multi-modal Person Search in VideosJiangyue Xia, Anyi Rao, Qingqiu Huang et al.
The task of searching certain people in videos has seen increasing potential in real-world applications, such as video organization and editing. Most existing approaches are devised to work in an offline manner, where identities can only be inferred after an entire video is examined. This working manner precludes such methods from being applied to online services or those applications that require real-time responses. In this paper, we propose an online person search framework, which can recognize people in a video on the fly. This framework maintains a multimodal memory bank at its heart as the basis for person recognition, and updates it dynamically with a policy obtained by reinforcement learning. Our experiments on a large movie dataset show that the proposed method is effective, not only achieving remarkable improvements over online schemes but also outperforming offline methods.
CVJul 7, 2020
Learning Model-Blind Temporal Denoisers without Ground TruthsYanghao Li, Bichuan Guo, Jiangtao Wen et al.
Denoisers trained with synthetic data often fail to cope with the diversity of unknown noises, giving way to methods that can adapt to existing noise without knowing its ground truth. Previous image-based method leads to noise overfitting if directly applied to video denoisers, and has inadequate temporal information management especially in terms of occlusion and lighting variation, which considerably hinders its denoising performance. In this paper, we propose a general framework for video denoising networks that successfully addresses these challenges. A novel twin sampler assembles training data by decoupling inputs from targets without altering semantics, which not only effectively solves the noise overfitting problem, but also generates better occlusion masks efficiently by checking optical flow consistency. An online denoising scheme and a warping loss regularizer are employed for better temporal alignment. Lighting variation is quantified based on the local similarity of aligned frames. Our method consistently outperforms the prior art by 0.6-3.2dB PSNR on multiple noises, datasets and network architectures. State-of-the-art results on reducing model-blind video noises are achieved. Extensive ablation studies are conducted to demonstrate the significance of each technical components.
MMNov 2, 2019
A Generalized Rate-Distortion-$λ$ Model Based HEVC Rate Control AlgorithmMinhao Tang, Jiangtao Wen, Yuxing Han
The High Efficiency Video Coding (HEVC/H.265) standard doubles the compression efficiency of the widely used H.264/AVC standard. For practical applications, rate control (RC) algorithms for HEVC need to be developed. Based on the R-Q, R-$ρ$ or R-$λ$ models, rate control algorithms aim at encoding a video clip/segment to a target bit rate accurately with high video quality after compression. Among the various models used by HEVC rate control algorithms, the R-$λ$ model performs the best in both coding efficiency and rate control accuracy. However, compared with encoding with a fixed quantization parameter (QP), even the best rate control algorithm [1] still under-performs when comparing the video quality achieved at identical average bit rates. In this paper, we propose a novel generalized rate-distortion-$λ$ (R-D-$λ$) model for the relationship between rate (R), distortion (D) and the Lagrangian multiplier ($λ$) in rate-distortion (RD) optimized encoding. In addition to the well designed hierarchical initialization and coefficient update scheme, a new model based rate allocation scheme composed of amortization, smooth window and consistency control is proposed for a better rate allocation. Experimental results implementing the proposed algorithm in the HEVC reference software HM-16.9 show that the proposed rate control algorithm is able to achieve an average of BDBR saving of 6.09%, 3.15% and 4.03% for random access (RA), low delay P (LDP) and low delay B (LDB) configurations respectively as compared with the R-$λ$ model based RC algorithm [1] implemented in HM. The proposed algorithm also outperforms the state-of-the-art algorithms, while rate control accuracy and encoding speed are hardly impacted.
CLAug 19, 2019
Long and Diverse Text Generation with Planning-based Hierarchical Variational ModelZhihong Shao, Minlie Huang, Jiangtao Wen et al.
Existing neural methods for data-to-text generation are still struggling to produce long and diverse texts: they are insufficient to model input data dynamically during generation, to capture inter-sentence coherence, or to generate diversified expressions. To address these issues, we propose a Planning-based Hierarchical Variational Model (PHVM). Our model first plans a sequence of groups (each group is a subset of input items to be covered by a sentence) and then realizes each sentence conditioned on the planning result and the previously generated context, thereby decomposing long text generation into dependent sentence generation sub-tasks. To capture expression diversity, we devise a hierarchical latent structure where a global planning latent variable models the diversity of reasonable planning and a sequence of local latent variables controls sentence realization. Experiments show that our model outperforms state-of-the-art baselines in long and diverse text generation.
MMAug 2, 2018
Two-pass Light Field Image Compression for Spatial Quality and Angular ConsistencyBichuan Guo, Jiangtao Wen, Yuxing Han
The quality assessment of light field images presents new challenges to conventional compression methods, as the spatial quality is affected by the optical distortion of capturing devices, and the angular consistency affects the performance of dynamic rendering applications. In this paper, we propose a two-pass encoding system for pseudo-temporal sequence based light field image compression with a novel frame level bit allocation framework that optimizes spatial quality and angular consistency simultaneously. Frame level rate-distortion models are estimated during the first pass, and the second pass performs the actual encoding with optimized bit allocations given by a two-step convex programming. The proposed framework supports various encoder configurations. Experimental results show that comparing to the anchor HM 16.16 (HEVC reference software), the proposed two-pass encoding system on average achieves 11.2% to 11.9% BD-rate reductions for the all-intra configuration, 15.8% to 32.7% BD-rate reductions for the random-access configuration, and 12.1% to 15.7% BD-rate reductions for the low-delay configuration. The resulting bit errors are limited, and the total time cost is less than twice of the one-pass anchor. Comparing with our earlier low-delay configuration based method, the proposed system improves BD-rate reduction by 3.1% to 8.3%, reduces the bit errors by more than 60%, and achieves more than 12x speed up.
MMJul 14, 2018
Fast Block Structure Determination in AV1-based Multiple Resolutions Video EncodingBichuan Guo, Yuxing Han, Jiangtao Wen
The widely used adaptive HTTP streaming requires an efficient algorithm to encode the same video to different resolutions. In this paper, we propose a fast block structure determination algorithm based on the AV1 codec that accelerates high resolution encoding, which is the bottle-neck of multiple resolutions encoding. The block structure similarity across resolutions is modeled by the fineness of frame detail and scale of object motions, this enables us to accelerate high resolution encoding based on low resolution encoding results. The average depth of a block's co-located neighborhood is used to decide early termination in the RDO process. Encoding results show that our proposed algorithm reduces encoding time by 30.1%-36.8%, while keeping BD-rate low at 0.71%-1.04%. Comparing to the state-of-the-art, our method halves performance loss without sacrificing time savings.
MMJul 14, 2018
Convex Optimization Based Bit Allocation for Light Field Compression under Weighting and Consistency ConstraintsBichuan Guo, Yuxing Han, Jiangtao Wen
Compared with conventional image and video, light field images introduce the weight channel, as well as the visual consistency of rendered view, information that has to be taken into account when compressing the pseudo-temporal-sequence (PTS) created from light field images. In this paper, we propose a novel frame level bit allocation framework for PTS coding. A joint model that measures weighted distortion and visual consistency, combined with an iterative encoding system, yields the optimal bit allocation for each frame by solving a convex optimization problem. Experimental results show that the proposed framework is effective in producing desired distortion distribution based on weights, and achieves up to 24.7% BD-rate reduction comparing to the default rate control algorithm.
MMJul 14, 2018
A Bayesian Approach to Block Structure Inference in AV1-based Multi-rate Video EncodingBichuan Guo, Xinyao Chen, Jiawen Gu et al.
Due to differences in frame structure, existing multi-rate video encoding algorithms cannot be directly adapted to encoders utilizing special reference frames such as AV1 without introducing substantial rate-distortion loss. To tackle this problem, we propose a novel bayesian block structure inference model inspired by a modification to an HEVC-based algorithm. It estimates the posterior probabilistic distributions of block partitioning, and adapts early terminations in the RDO procedure accordingly. Experimental results show that the proposed method provides flexibility for controlling the tradeoff between speed and coding efficiency, and can achieve an average time saving of 36.1% (up to 50.6%) with negligible bitrate cost.