LGJun 1
BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor AttacksIvan Sabolić, Marin Oršić, Josip Šarić et al.
Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image-text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.
CVMay 13
Sparse Code Uplifting for Efficient 3D Language Gaussian SplattingLovre Antonio Budimir, Yushi Guan, Steve Ryhner et al.
3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.
CVNov 19, 2020Code
The Cube++ Illumination Estimation DatasetEgor Ershov, Alex Savchik, Illya Semenkov et al.
Computational color constancy has the important task of reducing the influence of the scene illumination on the object colors. As such, it is an essential part of the image processing pipelines of most digital cameras. One of the important parts of the computational color constancy is illumination estimation, i.e. estimating the illumination color. When an illumination estimation method is proposed, its accuracy is usually reported by providing the values of error metrics obtained on the images of publicly available datasets. However, over time it has been shown that many of these datasets have problems such as too few images, inappropriate image quality, lack of scene diversity, absence of version tracking, violation of various assumptions, GDPR regulation violation, lack of additional shooting procedure info, etc. In this paper, a new illumination estimation dataset is proposed that aims to alleviate many of the mentioned problems and to help the illumination estimation research. It consists of 4890 images with known illumination colors as well as with additional semantic data that can further make the learning process more accurate. Due to the usage of the SpyderCube color target, for every image there are two ground-truth illumination records covering different directions. Because of that, the dataset can be used for training and testing of methods that perform single or two-illuminant estimation. This makes it superior to many similar existing datasets. The datasets, it's smaller version SimpleCube++, and the accompanying code are available at https://github.com/Visillect/CubePlusPlus/.
CVMar 29, 2019Code
CroP: Color Constancy Benchmark Dataset GeneratorNikola Banić, Karlo Koščević, Marko Subašić et al.
Implementing color constancy as a pre-processing step in contemporary digital cameras is of significant importance as it removes the influence of scene illumination on object colors. Several benchmark color constancy datasets have been created for the purpose of developing and testing new color constancy methods. However, they all have numerous drawbacks including a small number of images, erroneously extracted ground-truth illuminations, long histories of misuses, violations of their stated assumptions, etc. To overcome such and similar problems, in this paper a color constancy benchmark dataset generator is proposed. For a given camera sensor it enables generation of any number of realistic raw images taken in a subset of the real world, namely images of printed photographs. Datasets with such images share many positive features with other existing real-world datasets, while some of the negative features are completely eliminated. The generated images can be successfully used to train methods that afterward achieve high accuracy on real-world datasets. This opens the way for creating large enough datasets for advanced deep learning techniques. Experimental results are presented and discussed. The source code is available at http://www.fer.unizg.hr/ipg/resources/color_constancy/.
CVFeb 2, 2018Code
Green Stability Assumption: Unsupervised Learning for Statistics-Based Illumination EstimationNikola Banić, Sven Lončarić
In the image processing pipeline of almost every digital camera there is a part dedicated to computational color constancy i.e. to removing the influence of illumination on the colors of the image scene. Some of the best known illumination estimation methods are the so called statistics-based methods. They are less accurate than the learning-based illumination estimation methods, but they are faster and simpler to implement in embedded systems, which is one of the reasons for their widespread usage. Although in the relevant literature it often appears as if they require no training, this is not true because they have parameter values that need to be fine-tuned in order to be more accurate. In this paper it is first shown that the accuracy of statistics-based methods reported in most papers was not obtained by means of the necessary cross-validation, but by using the whole benchmark datasets for both training and testing. After that the corrected results are given for the best known benchmark datasets. Finally, the so called green stability assumption is proposed that can be used to fine-tune the values of the parameters of the statistics-based methods by using only non-calibrated images without known ground-truth illumination. The obtained accuracy is practically the same as when using calibrated training images, but the whole process is much faster. The experimental results are presented and discussed. The source code is available at http://www.fer.unizg.hr/ipg/resources/color_constancy/.
CVOct 1, 2013Code
Using the Random Sprays Retinex Algorithm for Global Illumination EstimationNikola Banić, Sven Lončarić
In this paper the use of Random Sprays Retinex (RSR) algorithm for global illumination estimation is proposed and its feasibility tested. Like other algorithms based on the Retinex model, RSR also provides local illumination estimation and brightness adjustment for each pixel and it is faster than other path-wise Retinex algorithms. As the assumption of the uniform illumination holds in many cases, it should be possible to use the mean of local illumination estimations of RSR as a global illumination estimation for images with (assumed) uniform illumination allowing also the accuracy to be easily measured. Therefore we propose a method for estimating global illumination estimation based on local RSR results. To our best knowledge this is the first time that RSR algorithm is used to obtain global illumination estimation. For our tests we use a publicly available color constancy image database for testing. The results are presented and discussed and it turns out that the proposed method outperforms many existing unsupervised color constancy algorithms. The source code is available at http://www.fer.unizg.hr/ipg/resources/color_constancy/.
CVOct 22, 2024
A Survey on Deep Learning-based Gaze Direction Regression: Searching for the State-of-the-artFranko Šikić, Donik Vršnak, Sven Lončarić
In this paper, we present a survey of deep learning-based methods for the regression of gaze direction vector from head and eye images. We describe in detail numerous published methods with a focus on the input data, architecture of the model, and loss function used to supervise the model. Additionally, we present a list of datasets that can be used to train and evaluate gaze direction regression methods. Furthermore, we noticed that the results reported in the literature are often not comparable one to another due to differences in the validation or even test subsets used. To address this problem, we re-evaluated several methods on the commonly used in-the-wild Gaze360 dataset using the same validation setup. The experimental results show that the latest methods, although claiming state-of-the-art results, significantly underperform compared with some older methods. Finally, we show that the temporal models outperform the static models under static test conditions.
CVOct 18, 2025
OOS-DSD: Improving Out-of-stock Detection in Retail Images using Auxiliary TasksFranko Šikić, Sven Lončarić
Out-of-stock (OOS) detection is a very important retail verification process that aims to infer the unavailability of products in their designated areas on the shelf. In this paper, we introduce OOS-DSD, a novel deep learning-based method that advances OOS detection through auxiliary learning. In particular, we extend a well-established YOLOv8 object detection architecture with additional convolutional branches to simultaneously detect OOS, segment products, and estimate scene depth. While OOS detection and product segmentation branches are trained using ground truth data, the depth estimation branch is trained using pseudo-labeled annotations produced by the state-of-the-art (SOTA) depth estimation model Depth Anything V2. Furthermore, since the aforementioned pseudo-labeled depth estimates display relative depth, we propose an appropriate depth normalization procedure that stabilizes the training process. The experimental results show that the proposed method surpassed the performance of the SOTA OOS detection methods by 1.8% of the mean average precision (mAP). In addition, ablation studies confirm the effectiveness of auxiliary learning and the proposed depth normalization procedure, with the former increasing mAP by 3.7% and the latter by 4.2%.
CVJun 3, 2025
Deep Learning for Retinal Degeneration Assessment: A Comprehensive Analysis of the MARIO AMD Progression ChallengeRachid Zeghlache, Ikram Brahim, Pierre-Henri Conze et al.
The MARIO challenge, held at MICCAI 2024, focused on advancing the automated detection and monitoring of age-related macular degeneration (AMD) through the analysis of optical coherence tomography (OCT) images. Designed to evaluate algorithmic performance in detecting neovascular activity changes within AMD, the challenge incorporated unique multi-modal datasets. The primary dataset, sourced from Brest, France, was used by participating teams to train and test their models. The final ranking was determined based on performance on this dataset. An auxiliary dataset from Algeria was used post-challenge to evaluate population and device shifts from submitted solutions. Two tasks were involved in the MARIO challenge. The first one was the classification of evolution between two consecutive 2D OCT B-scans. The second one was the prediction of future AMD evolution over three months for patients undergoing anti-vascular endothelial growth factor (VEGF) therapy. Thirty-five teams participated, with the top 12 finalists presenting their methods. This paper outlines the challenge's structure, tasks, data characteristics, and winning methodologies, setting a benchmark for AMD monitoring using OCT, infrared imaging, and clinical data (such as the number of visits, age, gender, etc.). The results of this challenge indicate that artificial intelligence (AI) performs as well as a physician in measuring AMD progression (Task 1) but is not yet able of predicting future evolution (Task 2).
CVMay 13, 2025
DHECA-SuperGaze: Dual Head-Eye Cross-Attention and Super-Resolution for Unconstrained Gaze EstimationFranko Šikić, Donik Vršnak, Sven Lončarić
Unconstrained gaze estimation is the process of determining where a subject is directing their visual attention in uncontrolled environments. Gaze estimation systems are important for a myriad of tasks such as driver distraction monitoring, exam proctoring, accessibility features in modern software, etc. However, these systems face challenges in real-world scenarios, partially due to the low resolution of in-the-wild images and partially due to insufficient modeling of head-eye interactions in current state-of-the-art (SOTA) methods. This paper introduces DHECA-SuperGaze, a deep learning-based method that advances gaze prediction through super-resolution (SR) and a dual head-eye cross-attention (DHECA) module. Our dual-branch convolutional backbone processes eye and multiscale SR head images, while the proposed DHECA module enables bidirectional feature refinement between the extracted visual features through cross-attention mechanisms. Furthermore, we identified critical annotation errors in one of the most diverse and widely used gaze estimation datasets, Gaze360, and rectified the mislabeled data. Performance evaluation on Gaze360 and GFIE datasets demonstrates superior within-dataset performance of the proposed method, reducing angular error (AE) by 0.48° (Gaze360) and 2.95° (GFIE) in static configurations, and 0.59° (Gaze360) and 3.00° (GFIE) in temporal settings compared to prior SOTA methods. Cross-dataset testing shows improvements in AE of more than 1.53° (Gaze360) and 3.99° (GFIE) in both static and temporal settings, validating the robust generalization properties of our approach.
CVDec 31, 2020
Illumination Estimation Challenge: experience of past two yearsEgor Ershov, Alex Savchik, Ilya Semenkov et al.
Illumination estimation is the essential step of computational color constancy, one of the core parts of various image processing pipelines of modern digital cameras. Having an accurate and reliable illumination estimation is important for reducing the illumination influence on the image colors. To motivate the generation of new ideas and the development of new algorithms in this field, the 2nd Illumination estimation challenge~(IEC\#2) was conducted. The main advantage of testing a method on a challenge over testing in on some of the known datasets is the fact that the ground-truth illuminations for the challenge test images are unknown up until the results have been submitted, which prevents any potential hyperparameter tuning that may be biased. The challenge had several tracks: general, indoor, and two-illuminant with each of them focusing on different parameters of the scenes. Other main features of it are a new large dataset of images (about 5000) taken with the same camera sensor model, a manual markup accompanying each image, diverse content with scenes taken in numerous countries under a huge variety of illuminations extracted by using the SpyderCube calibration object, and a contest-like markup for the images from the Cube+ dataset that was used in IEC\#1. This paper focuses on the description of the past two challenges, algorithms which won in each track, and the conclusions that were drawn based on the results obtained during the 1st and 2nd challenge that can be useful for similar future developments.
CVMay 3, 2019
Computational analysis of laminar structure of the human cortex based on local neuron featuresAndrija Štajduhar, Tomislav Lipić, Goran Sedmak et al.
In this paper, we present a novel method for analysis and segmentation of laminar structure of the cortex based on tissue characteristics whose change across the gray matter underlies distinctive between cortical layers. We develop and analyze features of individual neurons to investigate changes in cytoarchitectonic differentiation and present a novel high-performance, automated framework for neuron-level histological image analysis. Local tissue and cell descriptors such as density, neuron size and other measures are used for development of more complex neuron features used in machine learning model trained on data manually labeled by three human experts. Final neuron layer classifications were obtained by training a separate model for each expert and combining their probability outputs. Importances of developed neuron features on both global model level and individual prediction level are presented and discussed.
CVJun 1, 2018
Automatic Detection of Neurons in NeuN-stained Histological Images of Human BrainAndrija Štajduhar, Domagoj Džaja, Miloš Judaš et al.
In this paper, we present a novel use of an anisotropic diffusion model for automatic detection of neurons in histological sections of the adult human brain cortex. We use a partial differential equation model to process high resolution images to acquire locations of neuronal bodies. We also present a novel approach in model training and evaluation that considers variability among the human experts, addressing the issue of existence and correctness of the golden standard for neuron and cell counting, used in most of relevant papers. Our method, trained on dataset manually labeled by three experts, has correctly distinguished over 95% of neuron bodies in test data, doing so in time much shorter than other comparable methods.
CVDec 1, 2017
Unsupervised Learning for Color ConstancyNikola Banić, Karlo Koščević, Sven Lončarić
Most digital camera pipelines use color constancy methods to reduce the influence of illumination and camera sensor on the colors of scene objects. The highest accuracy of color correction is obtained with learning-based color constancy methods, but they require a significant amount of calibrated training images with known ground-truth illumination. Such calibration is time consuming, preferably done for each sensor individually, and therefore a major bottleneck in acquiring high color constancy accuracy. Statistics-based methods do not require calibrated training images, but they are less accurate. In this paper an unsupervised learning-based method is proposed that learns its parameter values after approximating the unknown ground-truth illumination of the training images, thus avoiding calibration. In terms of accuracy the proposed method outperforms all statistics-based and many learning-based methods. An extension of the method is also proposed, which learns the needed parameters from non-calibrated images taken with one sensor and which can then be successfully applied to images taken with another sensor. This effectively enables inter-camera unsupervised learning for color constancy. Additionally, a new high quality color constancy benchmark dataset with 1707 calibrated images is created, used for testing, and made publicly available. The results are presented and discussed. The source code and the dataset are available at http://www.fer.unizg.hr/ipg/resources/color_constancy/.
CVOct 16, 2015
An Extension to Hough Transform Based on Gradient OrientationTomislav Petković, Sven Lončarić
The Hough transform is one of the most common methods for line detection. In this paper we propose a novel extension of the regular Hough transform. The proposed extension combines the extension of the accumulator space and the local gradient orientation resulting in clutter reduction and yielding more prominent peaks, thus enabling better line identification. We demonstrate benefits in applications such as visual quality inspection and rectangle detection.
CVOct 1, 2013
Second Croatian Computer Vision Workshop (CCVW 2013)Sven Lončarić, Siniša Šegvić
Proceedings of the Second Croatian Computer Vision Workshop (CCVW 2013, http://www.fer.unizg.hr/crv/ccvw2013) held September 19, 2013, in Zagreb, Croatia. Workshop was organized by the Center of Excellence for Computer Vision of the University of Zagreb.
CVOct 1, 2013
Flexible Visual Quality Inspection in Discrete ManufacturingTomislav Petković, Darko Jurić, Sven Lončarić
Most visual quality inspections in discrete manufacturing are composed of length, surface, angle or intensity measurements. Those are implemented as end-user configurable inspection tools that should not require an image processing expert to set up. Currently available software solutions providing such capability use a flowchart based programming environment, but do not fully address an inspection flowchart robustness and can require a redefinition of the flowchart if a small variation is introduced. In this paper we propose an acquire-register-analyze image processing pattern designed for discrete manufacturing that aims to increase the robustness of the inspection flowchart by consistently addressing variations in product position, orientation and size. A proposed pattern is transparent to the end-user and simplifies the flowchart. We describe a developed software solution that is a practical implementation of the proposed pattern. We give an example of its real-life use in industrial production of electric components.