Nicolai Petkov

CV
h-index36
19papers
358citations
Novelty40%
AI Score27

19 Papers

CVMar 21, 2023Code
Data-efficient Large Scale Place Recognition with Graded Similarity Supervision

Maria Leyva-Vallina, Nicola Strisciuglio, Nicolai Petkov

Visual place recognition (VPR) is a fundamental task of computer vision for visual localization. Existing methods are trained using image pairs that either depict the same place or not. Such a binary indication does not consider continuous relations of similarity between images of the same place taken from different positions, determined by the continuous nature of camera pose. The binary similarity induces a noisy supervision signal into the training of VPR methods, which stall in local minima and require expensive hard mining algorithms to guarantee convergence. Motivated by the fact that two images of the same place only partially share visual cues due to camera pose differences, we deploy an automatic re-annotation strategy to re-label VPR datasets. We compute graded similarity labels for image pairs based on available localization metadata. Furthermore, we propose a new Generalized Contrastive Loss (GCL) that uses graded similarity labels for training contrastive networks. We demonstrate that the use of the new labels and GCL allow to dispense from hard-pair mining, and to train image descriptors that perform better in VPR by nearest neighbor search, obtaining superior or comparable results than methods that require expensive hard-pair mining and re-ranking techniques. Code and models available at: https://github.com/marialeyvallina/generalized_contrastive_loss

CVMar 11, 2021Code
Generalized Contrastive Optimization of Siamese Networks for Place Recognition

María Leyva-Vallina, Nicola Strisciuglio, Nicolai Petkov

Visual place recognition is a challenging task in computer vision and a key component of camera-based localization and navigation systems. Recently, Convolutional Neural Networks (CNNs) achieved high results and good generalization capabilities. They are usually trained using pairs or triplets of images labeled as either similar or dissimilar, in a binary fashion. In practice, the similarity between two images is not binary, but continuous. Furthermore, training these CNNs is computationally complex and involves costly pair and triplet mining strategies. We propose a Generalized Contrastive loss (GCL) function that relies on image similarity as a continuous measure, and use it to train a siamese CNN. Furthermore, we present three techniques for automatic annotation of image pairs with labels indicating their degree of similarity, and deploy them to re-annotate the MSLS, TB-Places, and 7Scenes datasets. We demonstrate that siamese CNNs trained using the GCL function and the improved annotations consistently outperform their binary counterparts. Our models trained on MSLS outperform the state-of-the-art methods, including NetVLAD, NetVLAD-SARE, AP-GeM and Patch-NetVLAD, and generalize well on the Pittsburgh30k, Tokyo 24/7, RobotCar Seasons v2 and Extended CMU Seasons datasets. Furthermore, training a siamese network using the GCL function does not require complex pair mining. We release the source code at https://github.com/marialeyvallina/generalized_contrastive_loss.

CVJun 27, 2020Code
MTStereo 2.0: improved accuracy of stereo depth estimation withMax-trees

Rafael Brandt, Nicola Strisciuglio, Nicolai Petkov

Efficient yet accurate extraction of depth from stereo image pairs is required by systems with low power resources, such as robotics and embedded systems. State-of-the-art stereo matching methods based on convolutional neural networks require intensive computations on GPUs and are difficult to deploy on embedded systems. In this paper, we propose a stereo matching method, called MTStereo 2.0, for limited-resource systems that require efficient and accurate depth estimation. It is based on a Max-tree hierarchical representation of image pairs, which we use to identify matching regions along image scan-lines. The method includes a cost function that considers similarity of region contextual information based on the Max-trees and a disparity border preserving cost aggregation approach. MTStereo 2.0 improves on its predecessor MTStereo 1.0 as it a) deploys a more robust cost function, b) performs more thorough detection of incorrect matches, c) computes disparity maps with pixel-level rather than node-level precision. MTStereo provides accurate sparse and semi-dense depth estimation and does not require intensive GPU computations like methods based on CNNs. Thus it can run on embedded and robotics devices with low-power requirements. We tested the proposed approach on several benchmark data sets, namely KITTI 2015, Driving, FlyingThings3D, Middlebury 2014, Monkaa and the TrimBot2020 garden data sets, and achieved competitive accuracy and efficiency. The code is available at https://github.com/rbrandt1/MaxTreeS.

CVJan 29, 2024
Regressing Transformers for Data-efficient Visual Place Recognition

María Leyva-Vallina, Nicola Strisciuglio, Nicolai Petkov

Visual place recognition is a critical task in computer vision, especially for localization and navigation systems. Existing methods often rely on contrastive learning: image descriptors are trained to have small distance for similar images and larger distance for dissimilar ones in a latent space. However, this approach struggles to ensure accurate distance-based image similarity representation, particularly when training with binary pairwise labels, and complex re-ranking strategies are required. This work introduces a fresh perspective by framing place recognition as a regression problem, using camera field-of-view overlap as similarity ground truth for learning. By optimizing image descriptors to align directly with graded similarity labels, this approach enhances ranking capabilities without expensive re-ranking, offering data-efficient training and strong generalization across several benchmark datasets.

CVJun 28, 2019
Place recognition in gardens by learning visual representations: data set and benchmark analysis

Maria Leyva-Vallina, Nicola Strisciuglio, Nicolai Petkov

Visual place recognition is an important component of systems for camera localization and loop closure detection. It concerns the recognition of a previously visited place based on visual cues only. Although it is a widely studied problem for indoor and urban environments, the recent use of robots for automation of agricultural and gardening tasks has created new problems, due to the challenging appearance of garden-like environments. Garden scenes predominantly contain green colors, as well as repetitive patterns and textures. The lack of available data recorded in gardens and natural environments makes the improvement of visual localization algorithms difficult. In this paper we propose an extended version of the TB-Places data set, which is designed for testing algorithms for visual place recognition. It contains images with ground truth camera pose recorded in real gardens in different seasons, with varying light conditions. We constructed and released a ground truth for all possible pairs of images, indicating whether they depict the same place or not. We present the results of a benchmark analysis of methods based on convolutional neural networks for holistic image description and place recognition. We train existing networks (i.e. ResNet, DenseNet and VGG NetVLAD) as backbone of a two-way architecture with a contrastive loss function. The results that we obtained demonstrate that learning garden-tailored representations contribute to an improvement of performance, although the generalization capabilities are limited.

CVMay 10, 2019
Towards Emotion Retrieval in Egocentric PhotoStream

Estefania Talavera, Petia Radeva, Nicolai Petkov

The availability and use of egocentric data are rapidly increasing due to the growing use of wearable cameras. Our aim is to study the effect (positive, neutral or negative) of egocentric images or events on an observer. Given egocentric photostreams capturing the wearer's days, we propose a method that aims to assign sentiment to events extracted from egocentric photostreams. Such moments can be candidates to retrieve according to their possibility of representing a positive experience for the camera's wearer. The proposed approach obtained a classification accuracy of 75% on the test set, with a deviation of 8%. Our model makes a step forward opening the door to sentiment recognition in egocentric photostreams.

CVMay 10, 2019
Hierarchical approach to classify food scenes in egocentric photo-streams

Estefania Talavera, Maria Leyva-Vallina, Md. Mostafa Kamal Sarker et al.

Recent studies have shown that the environment where people eat can affect their nutritional behaviour. In this work, we provide automatic tools for a personalised analysis of a person's health habits by the examination of daily recorded egocentric photo-streams. Specifically, we propose a new automatic approach for the classification of food-related environments, that is able to classify up to 15 such scenes. In this way, people can monitor the context around their food intake in order to get an objective insight into their daily eating routine. We propose a model that classifies food-related scenes organized in a semantic hierarchy. Additionally, we present and make available a new egocentric dataset composed of more than 33000 images recorded by a wearable camera, over which our proposed model has been tested. Our approach obtains an accuracy and F-score of 56\% and 65\%, respectively, clearly outperforming the baseline methods.

CVMay 10, 2019
Towards Unsupervised Familiar Scene Recognition in Egocentric Videos

Estefania Talavera, Nicolai Petkov, Petia Radeva

Nowadays, there is an upsurge of interest in using lifelogging devices. Such devices generate huge amounts of image data; consequently, the need for automatic methods for analyzing and summarizing these data is drastically increasing. We present a new method for familiar scene recognition in egocentric videos, based on background pattern detection through automatically configurable COSFIRE filters. We present some experiments over egocentric data acquired with the Narrative Clip.

CVMay 10, 2019
Unsupervised routine discovery in egocentric photo-streams

Estefania Talavera, Nicolai Petkov, Petia Radeva

The routine of a person is defined by the occurrence of activities throughout different days, and can directly affect the person's health. In this work, we address the recognition of routine related days. To do so, we rely on egocentric images, which are recorded by a wearable camera and allow to monitor the life of the user from a first-person view perspective. We propose an unsupervised model that identifies routine related days, following an outlier detection approach. We test the proposed framework over a total of 72 days in the form of photo-streams covering around 2 weeks of the life of 5 different camera wearers. Our model achieves an average of 76% Accuracy and 68% Weighted F-Score for all the users. Thus, we show that our framework is able to recognise routine related days and opens the door to the understanding of the behaviour of people.

CVMay 10, 2019
Towards Egocentric Person Re-identification and Social Pattern Analysis

Estefania Talavera, Alexandre Cola, Nicolai Petkov et al.

Wearable cameras capture a first-person view of the daily activities of the camera wearer, offering a visual diary of the user behaviour. Detection of the appearance of people the camera user interacts with for social interactions analysis is of high interest. Generally speaking, social events, lifestyle and health are highly correlated, but there is a lack of tools to monitor and analyse them. We consider that egocentric vision provides a tool to obtain information and understand users social interactions. We propose a model that enables us to evaluate and visualize social traits obtained by analysing social interactions appearance within egocentric photostreams. Given sets of egocentric images, we detect the appearance of faces within the days of the camera wearer, and rely on clustering algorithms to group their feature descriptors in order to re-identify persons. Recurrence of detected faces within photostreams allows us to shape an idea of the social pattern of behaviour of the user. We validated our model over several weeks recorded by different camera wearers. Our findings indicate that social profiles are potentially useful for social behaviour interpretation.

CVJan 29, 2019
A Push-Pull Layer Improves Robustness of Convolutional Neural Networks

Nicola Strisciuglio, Manuel Lopez-Antequera, Nicolai Petkov

We propose a new layer in Convolutional Neural Networks (CNNs) to increase their robustness to several types of noise perturbations of the input images. We call this a push-pull layer and compute its response as the combination of two half-wave rectified convolutions, with kernels of opposite polarity. It is based on a biologically-motivated non-linear model of certain neurons in the visual system that exhibit a response suppression phenomenon, known as push-pull inhibition. We validate our method by substituting the first convolutional layer of the LeNet-5 and WideResNet architectures with our push-pull layer. We train the networks on nonperturbed training images from the MNIST, CIFAR-10 and CIFAR-100 data sets, and test on images perturbed by noise that is unseen by the training process. We demonstrate that our push-pull layers contribute to a considerable improvement in robustness of classification of images perturbed by noise, while maintaining state-of-the-art performance on the original image classification task.

ASJan 21, 2019
Learning sound representations using trainable COPE feature extractors

Nicola Strisciuglio, Mario Vento, Nicolai Petkov

Sound analysis research has mainly been focused on speech and music processing. The deployed methodologies are not suitable for analysis of sounds with varying background noise, in many cases with very low signal-to-noise ratio (SNR). In this paper, we present a method for the detection of patterns of interest in audio signals. We propose novel trainable feature extractors, which we call COPE (Combination of Peaks of Energy). The structure of a COPE feature extractor is determined using a single prototype sound pattern in an automatic configuration process, which is a type of representation learning. We construct a set of COPE feature extractors, configured on a number of training patterns. Then we take their responses to build feature vectors that we use in combination with a classifier to detect and classify patterns of interest in audio signals. We carried out experiments on four public data sets: MIVIA audio events, MIVIA road events, ESC-10 and TU Dortmund data sets. The results that we achieved (recognition rate equal to 91.71% on the MIVIA audio events, 94% on the MIVIA road events, 81.25% on the ESC-10 and 94.27% on the TU Dortmund) demonstrate the effectiveness of the proposed method and are higher than the ones obtained by other existing approaches. The COPE feature extractors have high robustness to variations of SNR. Real-time performance is achieved even when the value of a large number of features is computed.

CVNov 26, 2018
Brain-inspired robust delineation operator

Nicola Strisciuglio, George Azzopardi, Nicolai Petkov

In this paper we present a novel filter, based on the existing COSFIRE filter, for the delineation of patterns of interest. It includes a mechanism of push-pull inhibition that improves robustness to noise in terms of spurious texture. Push-pull inhibition is a phenomenon that is observed in neurons in area V1 of the visual cortex, which suppresses the response of certain simple cells for stimuli of preferred orientation but of non-preferred contrast. This type of inhibition allows for sharper detection of the patterns of interest and improves the quality of delineation especially in images with spurious texture. We performed experiments on images from different applications, namely the detection of rose stems for automatic gardening, the delineation of cracks in pavements and road surfaces, and the segmentation of blood vessels in retinal images. Push-pull inhibition helped to improve results considerably in all applications.

ROApr 5, 2018
TrimBot2020: an outdoor robot for automatic gardening

Nicola Strisciuglio, Radim Tylecek, Michael Blaich et al.

Robots are increasingly present in modern industry and also in everyday life. Their applications range from health-related situations, for assistance to elderly people or in surgical operations, to automatic and driver-less vehicles (on wheels or flying) or for driving assistance. Recently, an interest towards robotics applied in agriculture and gardening has arisen, with applications to automatic seeding and cropping or to plant disease control, etc. Autonomous lawn mowers are succesful market applications of gardening robotics. In this paper, we present a novel robot that is developed within the TrimBot2020 project, funded by the EU H2020 program. The project aims at prototyping the first outdoor robot for automatic bush trimming and rose pruning.

CVAug 2, 2017
Action recognition by learning pose representations

Alessia Saggese, Nicola Strisciuglio, Mario Vento et al.

Pose detection is one of the fundamental steps for the recognition of human actions. In this paper we propose a novel trainable detector for recognizing human poses based on the analysis of the skeleton. The main idea is that a skeleton pose can be described by the spatial arrangements of its joints. Starting from this consideration, we propose a trainable pose detector, that can be configured on a prototype skeleton in an automatic configuration process. The result of the configuration is a model of the position of the joints in the concerned skeleton. In the application phase, the joint positions contained in the model are compared with the ones of their homologous joints in the skeleton under test. The similarity of two skeletons is computed as a combination of the position scores achieved by homologous joints. In this paper we describe an action classification method based on the use of the proposed trainable detectors to extract features from the skeletons. We performed experiments on the publicly available MSDRA data set and the achieved results confirm the effectiveness of the proposed approach.

CVJul 24, 2017
Detection of curved lines with B-COSFIRE filters: A case study on crack delineation

Nicola Strisciuglio, George Azzopardi, Nicolai Petkov

The detection of curvilinear structures is an important step for various computer vision applications, ranging from medical image analysis for segmentation of blood vessels, to remote sensing for the identification of roads and rivers, and to biometrics and robotics, among others. %The visual system of the brain has remarkable abilities to detect curvilinear structures in noisy images. This is a nontrivial task especially for the detection of thin or incomplete curvilinear structures surrounded with noise. We propose a general purpose curvilinear structure detector that uses the brain-inspired trainable B-COSFIRE filters. It consists of four main steps, namely nonlinear filtering with B-COSFIRE, thinning with non-maximum suppression, hysteresis thresholding and morphological closing. We demonstrate its effectiveness on a data set of noisy images with cracked pavements, where we achieve state-of-the-art results (F-measure=0.865). The proposed method can be employed in any computer vision methodology that requires the delineation of curvilinear and elongated structures.

CVJul 24, 2017
Delineation of line patterns in images using B-COSFIRE filters

Nicola Strisciuglio, Nicolai Petkov

Delineation of line patterns in images is a basic step required in various applications such as blood vessel detection in medical images, segmentation of rivers or roads in aerial images, detection of cracks in walls or pavements, etc. In this paper we present trainable B-COSFIRE filters, which are a model of some neurons in area V1 of the primary visual cortex, and apply it to the delineation of line patterns in different kinds of images. B-COSFIRE filters are trainable as their selectivity is determined in an automatic configuration process given a prototype pattern of interest. They are configurable to detect any preferred line structure (e.g. segments, corners, cross-overs, etc.), so usable for automatic data representation learning. We carried out experiments on two data sets, namely a line-network data set from INRIA and a data set of retinal fundus images named IOSTAR. The results that we achieved confirm the robustness of the proposed approach and its effectiveness in the delineation of line structures in different kinds of images.

CVMar 29, 2017
Sentiment Recognition in Egocentric Photostreams

Estefania Talavera, Nicola Strisciuglio, Nicolai Petkov et al.

Lifelogging is a process of collecting rich source of information about daily life of people. In this paper, we introduce the problem of sentiment analysis in egocentric events focusing on the moments that compose the images recalling positive, neutral or negative feelings to the observer. We propose a method for the classification of the sentiments in egocentric pictures based on global and semantic image features extracted by Convolutional Neural Networks. We carried out experiments on an egocentric dataset, which we organized in 3 classes on the basis of the sentiment that is recalled to the user (positive, negative or neutral).

CVMay 27, 2015
Training a Convolutional Neural Network for Appearance-Invariant Place Recognition

Ruben Gomez-Ojeda, Manuel Lopez-Antequera, Nicolai Petkov et al.

Place recognition is one of the most challenging problems in computer vision, and has become a key part in mobile robotics and autonomous driving applications for performing loop closure in visual SLAM systems. Moreover, the difficulty of recognizing a revisited location increases with appearance changes caused, for instance, by weather or illumination variations, which hinders the long-term application of such algorithms in real environments. In this paper we present a convolutional neural network (CNN), trained for the first time with the purpose of recognizing revisited locations under severe appearance changes, which maps images to a low dimensional space where Euclidean distances represent place dissimilarity. In order for the network to learn the desired invariances, we train it with triplets of images selected from datasets which present a challenging variability in visual appearance. The triplets are selected in such way that two samples are from the same location and the third one is taken from a different place. We validate our system through extensive experimentation, where we demonstrate better performance than state-of-art algorithms in a number of popular datasets.