CVDec 10, 2025
Efficient Feature Compression for Machines with Global Statistics PreservationMd Eimran Hossain Eimon, Hyomin Choi, Fabien Racapé et al.
The split-inference paradigm divides an artificial intelligence (AI) model into two parts. This necessitates the transfer of intermediate feature data between the two halves. Here, effective compression of the feature data becomes vital. In this paper, we employ Z-score normalization to efficiently recover the compressed feature data at the decoder side. To examine the efficacy of our method, the proposed method is integrated into the latest Feature Coding for Machines (FCM) codec standard under development by the Moving Picture Experts Group (MPEG). Our method supersedes the existing scaling method used by the current standard under development. It both reduces the overhead bits and improves the end-task accuracy. To further reduce the overhead in certain circumstances, we also propose a simplified method. Experiments show that using our proposed method shows 17.09% reduction in bitrate on average across different tasks and up to 65.69% for object tracking without sacrificing the task accuracy.
IVAug 4, 2022
Scalable Video Coding for Humans and MachinesHyomin Choi, Ivan V. Bajić
Video content is watched not only by humans, but increasingly also by machines. For example, machine learning models analyze surveillance video for security and traffic monitoring, search through YouTube videos for inappropriate content, and so on. In this paper, we propose a scalable video coding framework that supports machine vision (specifically, object detection) through its base layer bitstream and human vision via its enhancement layer bitstream. The proposed framework includes components from both conventional and Deep Neural Network (DNN)-based video coding. The results show that on object detection, the proposed framework achieves 13-19% bit savings compared to state-of-the-art video codecs, while remaining competitive in terms of MS-SSIM on the human vision task.
IVMay 4, 2022
Joint Image Compression and Denoising via Latent-Space ScalabilitySaeed Ranjbar Alvar, Mateen Ulhaq, Hyomin Choi et al.
When it comes to image compression in digital cameras, denoising is traditionally performed prior to compression. However, there are applications where image noise may be necessary to demonstrate the trustworthiness of the image, such as court evidence and image forensics. This means that noise itself needs to be coded, in addition to the clean image itself. In this paper, we present a learning-based image compression framework where image denoising and compression are performed jointly. The latent space of the image codec is organized in a scalable manner such that the clean image can be decoded from a subset of the latent space (the base layer), while the noisy image is decoded from the full latent space at a higher rate. Using a subset of the latent space for the denoised image allows denoising to be carried out at a lower rate. Besides providing a scalable representation of the noisy input image, performing denoising jointly with compression makes intuitive sense because noise is hard to compress; hence, compressibility is one of the criteria that may help distinguish noise from the signal. The proposed codec is compared against established compression and denoising benchmarks, and the experiments reveal considerable bitrate savings compared to a cascade combination of a state-of-the-art codec and a state-of-the-art denoiser.
IVFeb 29, 2024Code
Variable-Rate Learned Image Compression with Multi-Objective Optimization and Quantization-Reconstruction OffsetsFatih Kamisli, Fabien Racape, Hyomin Choi
Achieving successful variable bitrate compression with computationally simple algorithms from a single end-to-end learned image or video compression model remains a challenge. Many approaches have been proposed, including conditional auto-encoders, channel-adaptive gains for the latent tensor or uniformly quantizing all elements of the latent tensor. This paper follows the traditional approach to vary a single quantization step size to perform uniform quantization of all latent tensor elements. However, three modifications are proposed to improve the variable rate compression performance. First, multi objective optimization is used for (post) training. Second, a quantization-reconstruction offset is introduced into the quantization operation. Third, variable rate quantization is used also for the hyper latent. All these modifications can be made on a pre-trained single-rate compression model by performing post training. The algorithms are implemented into three well-known image compression models and the achieved variable rate compression results indicate negligible or minimal compression performance loss compared to training multiple models. (Codes will be shared at https://github.com/InterDigitalInc/CompressAI)
CVSep 25, 2025Code
CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasksHyomin Choi, Heeji Han, Chris Rosewarne et al.
With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: "remote" and "split" inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.
LGDec 27, 2025
Generalized Regularized Evidential Deep Learning Models: Theory and Comprehensive EvaluationDeep Shankar Pandey, Hyomin Choi, Qi Yu
Evidential deep learning (EDL) models, based on Subjective Logic, introduce a principled and computationally efficient way to make deterministic neural networks uncertainty-aware. The resulting evidential models can quantify fine-grained uncertainty using learned evidence. However, the Subjective-Logic framework constrains evidence to be non-negative, requiring specific activation functions whose geometric properties can induce activation-dependent learning-freeze behavior: a regime where gradients become extremely small for samples mapped into low-evidence regions. We theoretically characterize this behavior and analyze how different evidential activations influence learning dynamics. Building on this analysis, we design a general family of activation functions and corresponding evidential regularizers that provide an alternative pathway for consistent evidence updates across activation regimes. Extensive experiments on four benchmark classification problems (MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet), two few-shot classification problems, and blind face restoration problem empirically validate the developed theory and demonstrate the effectiveness of the proposed generalized regularized evidential models.
CVOct 19, 2025
How Universal Are SAM2 Features?Masoud Khairi Atani, Alon Harell, Hyomin Choi et al.
The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.
IVOct 15, 2025
Dedelayed: Deleting remote inference delay via on-device correctionDan Jacobellis, Mateen Ulhaq, Fabien Racapé et al.
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference -- an equivalent improvement to using a model ten times larger.
MLFeb 4, 2025
A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating SignalsRóbert Busa-Fekete, Julian Zimmert, András György et al.
Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we study an extension of this problem where side information indicating content changes, such as various types of web pings, for example, signals from sitemaps, content delivery networks, etc., is available. Incorporating such side information into the crawling policy is challenging, because (i) the signals can be noisy with false positive events and with missing change events; and (ii) the crawler should achieve a fair performance over web pages regardless of the quality of the side information, which might differ from web page to web page. We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval, and automatically adapt to the new optimal solution when the total bandwidth changes without centralized computation. Experiments clearly demonstrate the versatility of our approach.
CVDec 30, 2021
SFU-HW-Tracks-v1: Object Tracking Dataset on Raw Video SequencesTakehiro Tanaka, Hyomin Choi, Ivan V. Bajić
We present a dataset that contains object annotations with unique object identities (IDs) for the High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) sequences. Ground-truth annotations for 13 sequences were prepared and released as the dataset called SFU-HW-Tracks-v1. For each video frame, ground truth annotations include object class ID, object ID, and bounding box location and its dimensions. The dataset can be used to evaluate object tracking performance on uncompressed video sequences and study the relationship between video compression and object tracking.
LGMay 15, 2021
Lightweight Compression of Intermediate Neural Network Features for Collaborative IntelligenceRobert A. Cohen, Hyomin Choi, Ivan V. Bajić
In collaborative intelligence applications, part of a deep neural network (DNN) is deployed on a lightweight device such as a mobile phone or edge device, and the remaining portion of the DNN is processed where more computing resources are available, such as in the cloud. This paper presents a novel lightweight compression technique designed specifically to quantize and compress the features output by the intermediate layer of a split DNN, without requiring any retraining of the network weights. Mathematical models for estimating the clipping and quantization error of ReLU and leaky-ReLU activations at this intermediate layer are developed and used to compute optimal clipping ranges for coarse quantization. We also present a modified entropy-constrained design algorithm for quantizing clipped activations. When applied to popular object-detection and classification DNNs, we were able to compress the 32-bit floating point intermediate activations down to 0.6 to 0.8 bits, while keeping the loss in accuracy to less than 1%. When compared to HEVC, we found that the lightweight codec consistently provided better inference accuracy, by up to 1.3%. The performance and simplicity of this lightweight compression technique makes it an attractive option for coding an intermediate layer of a split neural network for edge/cloud applications.
LGMay 12, 2021
Lightweight compression of neural network feature tensors for collaborative intelligenceRobert A. Cohen, Hyomin Choi, Ivan V. Bajić
In collaborative intelligence applications, part of a deep neural network (DNN) is deployed on a relatively low-complexity device such as a mobile phone or edge device, and the remainder of the DNN is processed where more computing resources are available, such as in the cloud. This paper presents a novel lightweight compression technique designed specifically to code the activations of a split DNN layer, while having a low complexity suitable for edge devices and not requiring any retraining. We also present a modified entropy-constrained quantizer design algorithm optimized for clipped activations. When applied to popular object-detection and classification DNNs, we were able to compress the 32-bit floating point activations down to 0.6 to 0.8 bits, while keeping the loss in accuracy to less than 1%. When compared to HEVC, we found that the lightweight codec consistently provided better inference accuracy, by up to 1.3%. The performance and simplicity of this lightweight compression technique makes it an attractive option for coding a layer's activations in split neural networks for edge/cloud applications.
LGFeb 14, 2020
Back-and-Forth prediction for deep tensor compressionHyomin Choi, Robert A. Cohen, Ivan V. Bajic
Recent AI applications such as Collaborative Intelligence with neural networks involve transferring deep feature tensors between various computing devices. This necessitates tensor compression in order to optimize the usage of bandwidth-constrained channels between devices. In this paper we present a prediction scheme called Back-and-Forth (BaF) prediction, developed for deep feature tensors, which allows us to dramatically reduce tensor size and improve its compressibility. Our experiments with a state-of-the-art object detector demonstrate that the proposed method allows us to significantly reduce the number of bits needed for compressing feature tensors extracted from deep within the model, with negligible degradation of the detection performance and without requiring any retraining of the network weights. We achieve a 62% and 75% reduction in tensor size while keeping the loss in accuracy of the network to less than 1% and 2%, respectively.
IVDec 31, 2018
Deep Frame Prediction for Video CodingHyomin Choi, Ivan V. Bajic
We propose a novel frame prediction method using a deep neural network (DNN), with the goal of improving video coding efficiency. The proposed DNN makes use of decoded frames, at both encoder and decoder, to predict textures of the current coding block. Unlike conventional inter-prediction, the proposed method does not require any motion information to be transferred between the encoder and the decoder. Still, both uni-directional and bi-directional prediction are possible using the proposed DNN, which is enabled by the use of the temporal index channel, in addition to color channels. In this study, we developed a jointly trained DNN for both uni- and bi- directional prediction, as well as separate networks for uni- and bi-directional prediction, and compared the efficacy of both approaches. The proposed DNNs were compared with the conventional motion-compensated prediction in the latest video coding standard, HEVC, in terms of BD-Bitrate. The experiments show that the proposed joint DNN (for both uni- and bi-directional prediction) reduces the luminance bitrate by about 4.4%, 2.4%, and 2.3% in the Low delay P, Low delay, and Random access configurations, respectively. In addition, using the separately trained DNNs brings further bit savings of about 0.3%-0.5%.
IVApr 26, 2018
Near-Lossless Deep Feature Compression for Collaborative IntelligenceHyomin Choi, Ivan V. Bajic
Collaborative intelligence is a new paradigm for efficient deployment of deep neural networks across the mobile-cloud infrastructure. By dividing the network between the mobile and the cloud, it is possible to distribute the computational workload such that the overall energy and/or latency of the system is minimized. However, this necessitates sending deep feature data from the mobile to the cloud in order to perform inference. In this work, we examine the differences between the deep feature data and natural image data, and propose a simple and effective near-lossless deep feature compressor. The proposed method achieves up to 5% bit rate reduction compared to HEVC-Intra and even more against other popular image codecs. Finally, we suggest an approach for reconstructing the input image from compressed deep features in the cloud, that could serve to supplement the inference performed by the deep model.
CVFeb 12, 2018
Deep feature compression for collaborative object detectionHyomin Choi, Ivan V. Bajic
Recent studies have shown that the efficiency of deep neural networks in mobile applications can be significantly improved by distributing the computational workload between the mobile device and the cloud. This paradigm, termed collaborative intelligence, involves communicating feature data between the mobile and the cloud. The efficiency of such approach can be further improved by lossy compression of feature data, which has not been examined to date. In this work we focus on collaborative object detection and study the impact of both near-lossless and lossy compression of feature data on its accuracy. We also propose a strategy for improving the accuracy under lossy feature compression. Experiments indicate that using this strategy, the communication overhead can be reduced by up to 70% without sacrificing accuracy.
IVOct 30, 2017
High efficiency compression for object detectionHyomin Choi, Ivan V. Bajic
Image and video compression has traditionally been tailored to human vision. However, modern applications such as visual analytics and surveillance rely on computers seeing and analyzing the images before (or instead of) humans. For these applications, it is important to adjust compression to computer vision. In this paper we present a bit allocation and rate control strategy that is tailored to object detection. Using the initial convolutional layers of a state-of-the-art object detector, we create an importance map that can guide bit allocation to areas that are important for object detection. The proposed method enables bit rate savings of 7% or more compared to default HEVC, at the equivalent object detection rate.
CVOct 30, 2017
Can you find a face in a HEVC bitstream?Saeed Ranjbar Alvar, Hyomin Choi, Ivan V. Bajic
Finding faces in images is one of the most important tasks in computer vision, with applications in biometrics, surveillance, human-computer interaction, and other areas. In our earlier work, we demonstrated that it is possible to tell whether or not an image contains a face by only examining the HEVC syntax, without fully reconstructing the image. In the present work we move further in this direction by showing how to localize faces in HEVC-coded images, without full reconstruction. We also demonstrate the benefits that such approach can have in privacy-friendly face localization.
CVSep 9, 2017
Can you tell a face from a HEVC bitstream?Saeed Ranjbar Alvar, Hyomin Choi, Ivan V. Bajic
Image and video analytics are being increasingly used on a massive scale. Not only is the amount of data growing, but the complexity of the data processing pipelines is also increasing, thereby exacerbating the problem. It is becoming increasingly important to save computational resources wherever possible. We focus on one of the poster problems of visual analytics -- face detection -- and approach the issue of reducing the computation by asking: Is it possible to detect a face without full image reconstruction from the High Efficiency Video Coding (HEVC) bitstream? We demonstrate that this is indeed possible, with accuracy comparable to conventional face detection, by training a Convolutional Neural Network on the output of the HEVC entropy decoder.