Ivan V. Bajic

CV
h-index9
22papers
631citations
Novelty40%
AI Score46

22 Papers

CVSep 21, 2022
Rate-Distortion in Image Coding for Machines

Alon Harell, Anderson De Andrade, Ivan V. Bajic

In recent years, there has been a sharp increase in transmission of images to remote servers specifically for the purpose of computer vision. In many applications, such as surveillance, images are mostly transmitted for automated analysis, and rarely seen by humans. Using traditional compression for this scenario has been shown to be inefficient in terms of bit-rate, likely due to the focus on human based distortion metrics. Thus, it is important to create specific image coding methods for joint use by humans and machines. One way to create the machine side of such a codec is to perform feature matching of some intermediate layer in a Deep Neural Network performing the machine task. In this work, we explore the effects of the layer choice used in training a learnable codec for humans and machines. We prove, using the data processing inequality, that matching features from deeper layers is preferable in the sense of rate-distortion. Next, we confirm our findings empirically by re-training an existing model for scalable human-machine coding. In our experiments we show the trade-off between the human and machine sides of such a scalable model, and discuss the benefit of using deeper layers for training in that regard.

SPAug 18, 2022
Efficient Signed Graph Sampling via Balancing & Gershgorin Disc Perfect Alignment

Chinthaka Dinesh, Gene Cheung, Saghar Bagheri et al.

A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is exploited for graph filtering. However, existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. In this paper, we show that for datasets with strong inherent anti-correlations, a suitable graph contains both positive and negative edge weights. In response, we propose a linear-time signed graph sampling method centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix $\bar{\bf{C}}$, we first learn a sparse inverse matrix (graph Laplacian) $\mathcal{L}$ corresponding to a signed graph $\mathcal{G}$. We define the eigenvectors of Laplacian $\mathcal{L}_B$ for a balanced signed graph $\mathcal{G}_B$ -- approximating $\mathcal{G}$ via edge weight augmentation -- as graph frequency components. Next, we choose samples to minimize the low-pass filter reconstruction error in two steps. We first align all Gershgorin disc left-ends of Laplacian $\mathcal{L}_B$ at smallest eigenvalue $λ_{\min}(\mathcal{L}_B)$ via similarity transform $\mathcal{L}_p = §\mathcal{L}_B §^{-1}$, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on $\mathcal{L}_p$ using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experimental results show that our signed graph sampling method outperformed existing fast sampling schemes noticeably on various datasets.

CVOct 19, 2022
Understanding Key Point Cloud Features for Development Three-dimensional Adversarial Attacks

Hanieh Naderi, Chinthaka Dinesh, Ivan V. Bajic et al.

Adversarial attacks pose serious challenges for deep neural network (DNN)-based analysis of various input signals. In the case of three-dimensional point clouds, methods have been developed to identify points that play a key role in network decision, and these become crucial in generating existing adversarial attacks. For example, a saliency map approach is a popular method for identifying adversarial drop points, whose removal would significantly impact the network decision. This paper seeks to enhance the understanding of three-dimensional adversarial attacks by exploring which point cloud features are most important for predicting adversarial points. Specifically, Fourteen key point cloud features such as edge intensity and distance from the centroid are defined, and multiple linear regression is employed to assess their predictive power for adversarial points. Based on critical feature selection insights, a new attack method has been developed to evaluate whether the selected features can generate an attack successfully. Unlike traditional attack methods that rely on model-specific vulnerabilities, this approach focuses on the intrinsic characteristics of the point clouds themselves. It is demonstrated that these features can predict adversarial points across four different DNN architectures, Point Network (PointNet), PointNet++, Dynamic Graph Convolutional Neural Networks (DGCNN), and Point Convolutional Network (PointConv) outperforming random guessing and achieving results comparable to saliency map-based attacks. This study has important engineering applications, such as enhancing the security and robustness of three-dimensional point cloud-based systems in fields like robotics and autonomous driving.

91.5IVMar 31Code
Prompt-Guided Prefiltering for VLM Image Compression

Bardia Azizian, Ivan V. Bajic

The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human-centric codecs are suboptimal in this setting because they preserve many task-irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt-driven VLMs with an open-ended variety of objectives. We propose a lightweight, plug-and-play, prompt-guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression efficiency. It is codec-agnostic and can be applied before conventional and learned encoders. Experiments on several VQA benchmarks show that our approach achieves a 25-50% average bitrate reduction while maintaining the same task accuracy. Our source code is available at https://github.com/bardia-az/pgp-vlm-compression.

CVJan 26
Smart Split-Federated Learning over Noisy Channels for Embryo Image Segmentation

Zahra Hafezi Kafshgari, Ivan V. Bajic, Parvaneh Saeedi

Split-Federated (SplitFed) learning is an extension of federated learning that places minimal requirements on the clients computing infrastructure, since only a small portion of the overall model is deployed on the clients hardware. In SplitFed learning, feature values, gradient updates, and model updates are transferred across communication channels. In this paper, we study the effects of noise in the communication channels on the learning process and the quality of the final model. We propose a smart averaging strategy for SplitFed learning with the goal of improving resilience against channel noise. Experiments on a segmentation model for embryo images shows that the proposed smart averaging strategy is able to tolerate two orders of magnitude stronger noise in the communication channels compared to conventional averaging, while still maintaining the accuracy of the final model.

CVOct 19, 2025
How Universal Are SAM2 Features?

Masoud Khairi Atani, Alon Harell, Hyomin Choi et al.

The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.

IVMay 17, 2023
VVC+M: Plug and Play Scalable Image Coding for Humans and Machines

Alon Harell, Yalda Foroutan, Ivan V. Bajic

Compression for machines is an emerging field, where inputs are encoded while optimizing the performance of downstream automated analysis. In scalable coding for humans and machines, the compressed representation used for machines is further utilized to enable input reconstruction. Often performed by jointly optimizing the compression scheme for both machine task and human perception, this results in sub-optimal rate-distortion (RD) performance for the machine side. We focus on the case of images, proposing to utilize the pre-existing residual coding capabilities of video codecs such as VVC to create a scalable codec from any image compression for machines (ICM) scheme. Using our approach we improve an existing scalable codec to achieve superior RD performance on the machine task, while remaining competitive for human perception. Moreover, our approach can be trained post-hoc for any given ICM scheme, and without creating a coupling between the quality of the machine analysis and human vision.

AISep 16, 2020
Exploring Bayesian Surprise to Prevent Overfitting and to Predict Model Performance in Non-Intrusive Load Monitoring

Richard Jones, Christoph Klemenjak, Stephen Makonin et al.

Non-Intrusive Load Monitoring (NILM) is a field of research focused on segregating constituent electrical loads in a system based only on their aggregated signal. Significant computational resources and research time are spent training models, often using as much data as possible, perhaps driven by the preconception that more data equates to more accurate models and better performing algorithms. When has enough prior training been done? When has a NILM algorithm encountered new, unseen data? This work applies the notion of Bayesian surprise to answer these questions which are important for both supervised and unsupervised algorithms. We quantify the degree of surprise between the predictive distribution (termed postdictive surprise), as well as the transitional probabilities (termed transitional surprise), before and after a window of observations. We compare the performance of several benchmark NILM algorithms supported by NILMTK, in order to establish a useful threshold on the two combined measures of surprise. We validate the use of transitional surprise by exploring the performance of a popular Hidden Markov Model as a function of surprise threshold. Finally, we explore the use of a surprise threshold as a regularization technique to avoid overfitting in cross-dataset performance. Although the generality of the specific surprise threshold discussed herein may be suspect without further testing, this work provides clear evidence that a point of diminishing returns of model performance with respect to dataset size exists. This has implications for future model development, dataset acquisition, as well as aiding in model flexibility during deployment.

SPJul 20, 2020
PowerGAN: Synthesizing Appliance Power Signatures Using Generative Adversarial Networks

Alon Harell, Richard Jones, Stephen Makonin et al.

Non-intrusive load monitoring (NILM) allows users and energy providers to gain insight into home appliance electricity consumption using only the building's smart meter. Most current techniques for NILM are trained using significant amounts of labeled appliances power data. The collection of such data is challenging, making data a major bottleneck in creating well generalizing NILM solutions. To help mitigate the data limitations, we present the first truly synthetic appliance power signature generator. Our solution, PowerGAN, is based on conditional, progressively growing, 1-D Wasserstein generative adversarial network (GAN). Using PowerGAN, we are able to synthesise truly random and realistic appliance power data signatures. We evaluate the samples generated by PowerGAN in a qualitative way as well as numerically by using traditional GAN evaluation methods such as the Inception score.

MMMar 6, 2020
Soft Video Multicasting Using Adaptive Compressed Sensing

Hadi Hadizadeh, Ivan V. bajic

Recently, soft video multicasting has gained a lot of attention, especially in broadcast and mobile scenarios where the bit rate supported by the channel may differ across receivers, and may vary quickly over time. Unlike the conventional designs that force the source to use a single bit rate according to the receiver with the worst channel quality, soft video delivery schemes transmit the video such that the video quality at each receiver is commensurate with its specific instantaneous channel quality. In this paper, we present a soft video multicasting system using an adaptive block-based compressed sensing (BCS) method. The proposed system consists of an encoder, a transmission system, and a decoder. At the encoder side, each block in each frame of the input video is adaptively sampled with a rate that depends on the texture complexity and visual saliency of the block. The obtained BCS samples are then placed into several packets, and the packets are transmitted via a channel-aware OFDM (orthogonal frequency division multiplexing) transmission system with a number of subchannels. At the decoder side, the received BCS samples are first used to build an initial approximation of the transmitted frame. To further improve the reconstruction quality, an iterative BCS reconstruction algorithm is then proposed that uses an adaptive transform and an adaptive soft-thresholding operator, which exploits the temporal similarity between adjacent frames to achieve better reconstruction quality. The extensive objective and subjective experimental results indicate the superiority of the proposed system over the state-of-the-art soft video multicasting systems.

LGFeb 14, 2020
Back-and-Forth prediction for deep tensor compression

Hyomin Choi, Robert A. Cohen, Ivan V. Bajic

Recent AI applications such as Collaborative Intelligence with neural networks involve transferring deep feature tensors between various computing devices. This necessitates tensor compression in order to optimize the usage of bandwidth-constrained channels between devices. In this paper we present a prediction scheme called Back-and-Forth (BaF) prediction, developed for deep feature tensors, which allows us to dramatically reduce tensor size and improve its compressibility. Our experiments with a state-of-the-art object detector demonstrate that the proposed method allows us to significantly reduce the number of bits needed for compressing feature tensors extracted from deep within the model, with negligible degradation of the detection performance and without requiring any retraining of the network weights. We achieve a 62% and 75% reduction in tensor size while keeping the loss in accuracy of the network to less than 1% and 2%, respectively.

CVJan 13, 2020
Towards Automated Swimming Analytics Using Deep Neural Networks

Timothy Woinoski, Alon Harell, Ivan V. Bajic

Methods for creating a system to automate the collection of swimming analytics on a pool-wide scale are considered in this paper. There has not been much work on swimmer tracking or the creation of a swimmer database for machine learning purposes. Consequently, methods for collecting swimmer data from videos of swim competitions are explored and analyzed. The result is a guide to the creation of a comprehensive collection of swimming data suitable for training swimmer detection and tracking systems. With this database in place, systems can then be created to automate the collection of swimming analytics.

CVJun 27, 2019
Datasets for Face and Object Detection in Fisheye Images

Jianglin Fu, Ivan V. Bajic, Rodney G. Vaughan

We present two new fisheye image datasets for training face and object detection models: VOC-360 and Wider-360. The fisheye images are created by post-processing regular images collected from two well-known datasets, VOC2012 and Wider Face, using a model for mapping regular to fisheye images implemented in Matlab. VOC-360 contains 39,575 fisheye images for object detection, segmentation, and classification. Wider-360 contains 63,897 fisheye images for face detection. These datasets will be useful for developing face and object detectors as well as segmentation modules for fisheye images while the efforts to collect and manually annotate true fisheye images are underway.

CVFeb 7, 2019
FDDB-360: Face Detection in 360-degree Fisheye Images

Jianglin Fu, Saeed Ranjbar Alvar, Ivan V. Bajic et al.

360-degree cameras offer the possibility to cover a large area, for example an entire room, without using multiple distributed vision sensors. However, geometric distortions introduced by their lenses make computer vision problems more challenging. In this paper we address face detection in 360-degree fisheye images. We show how a face detector trained on regular images can be re-trained for this purpose, and we also provide a 360-degree fisheye-like version of the popular FDDB face detection dataset, which we call FDDB-360.

IVDec 31, 2018
Deep Frame Prediction for Video Coding

Hyomin Choi, Ivan V. Bajic

We propose a novel frame prediction method using a deep neural network (DNN), with the goal of improving video coding efficiency. The proposed DNN makes use of decoded frames, at both encoder and decoder, to predict textures of the current coding block. Unlike conventional inter-prediction, the proposed method does not require any motion information to be transferred between the encoder and the decoder. Still, both uni-directional and bi-directional prediction are possible using the proposed DNN, which is enabled by the use of the temporal index channel, in addition to color channels. In this study, we developed a jointly trained DNN for both uni- and bi- directional prediction, as well as separate networks for uni- and bi-directional prediction, and compared the efficacy of both approaches. The proposed DNNs were compared with the conventional motion-compensated prediction in the latest video coding standard, HEVC, in terms of BD-Bitrate. The experiments show that the proposed joint DNN (for both uni- and bi-directional prediction) reduces the luminance bitrate by about 4.4%, 2.4%, and 2.3% in the Low delay P, Low delay, and Random access configurations, respectively. In addition, using the separately trained DNNs brings further bit savings of about 0.3%-0.5%.

IVApr 26, 2018
Near-Lossless Deep Feature Compression for Collaborative Intelligence

Hyomin Choi, Ivan V. Bajic

Collaborative intelligence is a new paradigm for efficient deployment of deep neural networks across the mobile-cloud infrastructure. By dividing the network between the mobile and the cloud, it is possible to distribute the computational workload such that the overall energy and/or latency of the system is minimized. However, this necessitates sending deep feature data from the mobile to the cloud in order to perform inference. In this work, we examine the differences between the deep feature data and natural image data, and propose a simple and effective near-lossless deep feature compressor. The proposed method achieves up to 5% bit rate reduction compared to HEVC-Intra and even more against other popular image codecs. Finally, we suggest an approach for reconstructing the input image from compressed deep features in the cloud, that could serve to supplement the inference performed by the deep model.

CVFeb 12, 2018
Deep feature compression for collaborative object detection

Hyomin Choi, Ivan V. Bajic

Recent studies have shown that the efficiency of deep neural networks in mobile applications can be significantly improved by distributing the computational workload between the mobile device and the cloud. This paradigm, termed collaborative intelligence, involves communicating feature data between the mobile and the cloud. The efficiency of such approach can be further improved by lossy compression of feature data, which has not been examined to date. In this work we focus on collaborative object detection and study the impact of both near-lossless and lossy compression of feature data on its accuracy. We also propose a strategy for improving the accuracy under lossy feature compression. Experiments indicate that using this strategy, the communication overhead can be reduced by up to 70% without sacrificing accuracy.

IVOct 30, 2017
High efficiency compression for object detection

Hyomin Choi, Ivan V. Bajic

Image and video compression has traditionally been tailored to human vision. However, modern applications such as visual analytics and surveillance rely on computers seeing and analyzing the images before (or instead of) humans. For these applications, it is important to adjust compression to computer vision. In this paper we present a bit allocation and rate control strategy that is tailored to object detection. Using the initial convolutional layers of a state-of-the-art object detector, we create an importance map that can guide bit allocation to areas that are important for object detection. The proposed method enables bit rate savings of 7% or more compared to default HEVC, at the equivalent object detection rate.

CVOct 30, 2017
Can you find a face in a HEVC bitstream?

Saeed Ranjbar Alvar, Hyomin Choi, Ivan V. Bajic

Finding faces in images is one of the most important tasks in computer vision, with applications in biometrics, surveillance, human-computer interaction, and other areas. In our earlier work, we demonstrated that it is possible to tell whether or not an image contains a face by only examining the HEVC syntax, without fully reconstructing the image. In the present work we move further in this direction by showing how to localize faces in HEVC-coded images, without full reconstruction. We also demonstrate the benefits that such approach can have in privacy-friendly face localization.

CVSep 9, 2017
Can you tell a face from a HEVC bitstream?

Saeed Ranjbar Alvar, Hyomin Choi, Ivan V. Bajic

Image and video analytics are being increasingly used on a massive scale. Not only is the amount of data growing, but the complexity of the data processing pipelines is also increasing, thereby exacerbating the problem. It is becoming increasingly important to save computational resources wherever possible. We focus on one of the poster problems of visual analytics -- face detection -- and approach the issue of reducing the computation by asking: Is it possible to detect a face without full image reconstruction from the High Efficiency Video Coding (HEVC) bitstream? We demonstrate that this is indeed possible, with accuracy comparable to conventional face detection, by training a Convolutional Neural Network on the output of the HEVC entropy decoder.

MMApr 25, 2016
Compressed-domain visual saliency models: A comparative study

Sayed Hossein Khatoonabadi, Ivan V. Bajic, Yufeng Shan

Computational modeling of visual saliency has become an important research problem in recent years, with applications in video quality estimation, video compression, object tracking, retargeting, summarization, and so on. While most visual saliency models for dynamic scenes operate on raw video, several models have been developed for use with compressed-domain information such as motion vectors and transform coefficients. This paper presents a comparative study of eleven such models as well as two high-performing pixel-domain saliency models on two eye-tracking datasets using several comparison metrics. The results indicate that highly accurate saliency estimation is possible based only on a partially decoded video bitstream. The strategies that have shown success in compressed-domain saliency modeling are highlighted, and certain challenges are identified as potential avenues for further improvement.

AIMar 24, 2016
Load Disaggregation Based on Aided Linear Integer Programming

Md. Zulfiquar Ali Bhotto, Stephen Makonin, Ivan V. Bajic

Load disaggregation based on aided linear integer programming (ALIP) is proposed. We start with a conventional linear integer programming (IP) based disaggregation and enhance it in several ways. The enhancements include additional constraints, correction based on a state diagram, median filtering, and linear programming-based refinement. With the aid of these enhancements, the performance of IP-based disaggregation is significantly improved. The proposed ALIP system relies only on the instantaneous load samples instead of waveform signatures, and hence does not crucially depend on high sampling frequency. Experimental results show that the proposed ALIP system performs better than the conventional IP-based load disaggregation system.