Konstantinos Karantzalos

CV
h-index43
20papers
660citations
Novelty42%
AI Score55

20 Papers

CVMar 23, 2022Code
What to Hide from Your Students: Attention-Guided Masked Image Modeling

Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas et al.

Transformers and masked language modeling are quickly being adopted and explored in computer vision as vision transformers and masked image modeling (MIM). In this work, we argue that image token masking differs from token masking in text, due to the amount and correlation of tokens in an image. In particular, to generate a challenging pretext task for MIM, we advocate a shift from random masking to informed masking. We develop and exhibit this idea in the context of distillation-based MIM, where a teacher transformer encoder generates an attention map, which we use to guide masking for the student. We thus introduce a novel masking strategy, called attention-guided masking (AttMask), and we demonstrate its effectiveness over random masking for dense distillation-based MIM as well as plain distillation-based self-supervised learning on classification tokens. We confirm that AttMask accelerates the learning process and improves the performance on a variety of downstream tasks. We provide the implementation code at https://github.com/gkakogeorgiou/attmask.

CVAug 21, 2022Code
Objects Can Move: 3D Change Detection by Geometric Transformation Constistency

Aikaterini Adam, Torsten Sattler, Konstantinos Karantzalos et al.

AR/VR applications and robots need to know when the scene has changed. An example is when objects are moved, added, or removed from the scene. We propose a 3D object discovery method that is based only on scene changes. Our method does not need to encode any assumptions about what is an object, but rather discovers objects by exploiting their coherent move. Changes are initially detected as differences in the depth maps and segmented as objects if they undergo rigid motions. A graph cut optimization propagates the changing labels to geometrically consistent regions. Experiments show that our method achieves state-of-the-art performance on the 3RScan dataset against competitive baselines. The source code of our method can be found at https://github.com/katadam/ObjectsCanMove.

CVSep 13, 2023Code
Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?

Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos et al.

Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? In this work, we develop a generic pooling framework and then we formulate a number of existing methods as instantiations. By discussing the properties of each group of methods, we derive SimPool, a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. We find that, whether supervised or self-supervised, this improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. One could thus call SimPool universal. To our knowledge, we are the first to obtain attention maps in supervised transformers of at least as good quality as self-supervised, without explicit losses or modifying the architecture. Code at: https://github.com/billpsomas/simpool.

CVNov 6, 2023Code
FLOGA: A machine learning ready dataset, a benchmark and a novel deep learning model for burnt area mapping with Sentinel-2

Maria Sdraka, Alkinoos Dimakos, Alexandros Malounis et al.

Over the last decade there has been an increasing frequency and intensity of wildfires across the globe, posing significant threats to human and animal lives, ecosystems, and socio-economic stability. Therefore urgent action is required to mitigate their devastating impact and safeguard Earth's natural resources. Robust Machine Learning methods combined with the abundance of high-resolution satellite imagery can provide accurate and timely mappings of the affected area in order to assess the scale of the event, identify the impacted assets and prioritize and allocate resources effectively for the proper restoration of the damaged region. In this work, we create and introduce a machine-learning ready dataset we name FLOGA (Forest wiLdfire Observations for the Greek Area). This dataset is unique as it comprises of satellite imagery acquired before and after a wildfire event, it contains information from Sentinel-2 and MODIS modalities with variable spatial and spectral resolution, and contains a large number of events where the corresponding burnt area ground truth has been annotated by domain experts. FLOGA covers the wider region of Greece, which is characterized by a Mediterranean landscape and climatic conditions. We use FLOGA to provide a thorough comparison of multiple Machine Learning and Deep Learning algorithms for the automatic extraction of burnt areas, approached as a change detection task. We also compare the results to those obtained using standard specialized spectral indices for burnt area mapping. Finally, we propose a novel Deep Learning model, namely BAM-CD. Our benchmark results demonstrate the efficacy of the proposed technique in the automatic extraction of burnt areas, outperforming all other methods in terms of accuracy and robustness. Our dataset and code are publicly available at: https://github.com/Orion-AI-Lab/FLOGA.

49.6CVMay 23Code
Benchmarking Composed Image Retrieval for Applied Earth Observation

Bill Psomas, Dionysis Christopoulos, Thanasis Petropoulos et al.

Remote sensing composed image retrieval (RSCIR) enables search in large satellite image archives using composed queries that combine a reference image with a textual modifier. Although RSCIR offers a flexible interface for expressing targeted retrieval intent, the transferability of modern composition methods to Earth observation (EO) imagery and their relevance to operational EO workflows remain underexplored. We address this gap through a unified benchmark and an application-oriented study. First, we systematically adapt and evaluate representative composed image retrieval methods with six vision-language backbones on PatternCom under a standardized protocol, analyzing their behavior across backbones, composition strategies, and query types. Second, we introduce xView2-CIR, a change-centric dataset for disaster and damage monitoring, where retrieval is conditioned on scene identity and a target post-event state. Our results show that training-free composition methods provide strong and scalable baselines for EO retrieval, while change-centric retrieval presents different challenges from attribute-based retrieval, particularly due to the need to preserve scene identity. Overall, this study establishes a practical benchmark for RSCIR and positions composed retrieval as a complementary tool for remote sensing image retrieval, archive exploration, and change analysis. The dataset and code are available at https://github.com/billpsomas/rscir.

LGJul 16, 2024
Mapping savannah woody vegetation at the species level with multispecral drone and hyperspectral EnMAP data

Christina Karakizi, Akpona Okujeni, Eleni Sofikiti et al.

Savannahs are vital ecosystems whose sustainability is endangered by the spread of woody plants. This research targets the accurate mapping of fractional woody cover (FWC) at the species level in a South African savannah, using EnMAP hyperspectral data. Field annotations were combined with very high-resolution multispectral drone data to produce land cover maps that included three woody species. The high-resolution labelled maps were then used to generate FWC samples for each woody species class at the 30-m spatial resolution of EnMAP. Four machine learning regression algorithms were tested for FWC mapping on dry season EnMAP imagery. The contribution of multitemporal information was also assessed by incorporating as additional regression features, spectro-temporal metrics from Sentinel-2 data of both the dry and wet seasons. The results demonstrated the suitability of our approach for accurately mapping FWC at the species level. The highest accuracy rates achieved from the combined EnMAP and Sentinel-2 experiments highlighted their synergistic potential for species-level vegetation mapping.

CVAug 6, 2022Code
Transformer-based assignment decision network for multiple object tracking

Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos

Data association is a crucial component for any multiple object tracking (MOT) method that follows the tracking-by-detection paradigm. To generate complete trajectories such methods employ a data association process to establish assignments between detections and existing targets during each timestep. Recent data association approaches try to solve either a multi-dimensional linear assignment task or a network flow minimization problem or tackle it via multiple hypotheses tracking. However, during inference an optimization step that computes optimal assignments is required for every sequence frame inducing additional complexity to any given solution. To this end, in the context of this work we introduce Transformer-based Assignment Decision Network (TADN) that tackles data association without the need of any explicit optimization during inference. In particular, TADN can directly infer assignment pairs between detections and active targets in a single forward pass of the network. We have integrated TADN in a rather simple MOT framework, designed a novel training strategy for efficient end-to-end training and demonstrated the high potential of our approach for online visual tracking-by-detection MOT on several popular benchmarks, i.e. MOT17, MOT20 and UA-DETRAC. Our proposed approach demonstrates strong performance in most evaluation metrics despite its simple nature as a tracker lacking significant auxiliary components such as occlusion handling or re-identification. The implementation of our method is publicly available at https://github.com/psaltaath/tadn-mot.

CVDec 20, 2022
Seafloor-Invariant Caustics Removal from Underwater Imagery

Panagiotis Agrafiotis, Konstantinos Karantzalos, Andreas Georgopoulos

Mapping the seafloor with underwater imaging cameras is of significant importance for various applications including marine engineering, geology, geomorphology, archaeology and biology. For shallow waters, among the underwater imaging challenges, caustics i.e., the complex physical phenomena resulting from the projection of light rays being refracted by the wavy surface, is likely the most crucial one. Caustics is the main factor during underwater imaging campaigns that massively degrade image quality and affect severely any 2D mosaicking or 3D reconstruction of the seabed. In this work, we propose a novel method for correcting the radiometric effects of caustics on shallow underwater imagery. Contrary to the state-of-the-art, the developed method can handle seabed and riverbed of any anaglyph, correcting the images using real pixel information, thus, improving image matching and 3D reconstruction processes. In particular, the developed method employs deep learning architectures in order to classify image pixels to "non-caustics" and "caustics". Then, exploits the 3D geometry of the scene to achieve a pixel-wise correction, by transferring appropriate color values between the overlapping underwater images. Moreover, to fill the current gap, we have collected, annotated and structured a real-world caustic dataset, namely R-CAUSTIC, which is openly available. Overall, based on the experimental results and validation the developed methodology is quite promising in both detecting caustics and reconstructing their intensity.

CVDec 9, 2025Code
What really matters for person re-identification? A Mixture-of-Experts Framework for Semantic Attribute Importance

Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos

State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at https://github.com/psaltaath/MoSAIC-ReID

51.6CRMay 21
Building Europe's Quantum Shield: The Strategic view for a Continent-Wide Quantum Key Ditribution (QKD) Infrastructure

Leandros Maglaras, Ilias Papastamatiou, Alexios Aivaliotis et al.

The fast growth of quantum computing can lead to amazing scientific breakthroughs while on the same time can be used to break today's security systems, raising new risks for existing digital systems. Facing this challenge, the European's Union's deployement of the European Communication Infrastructure (EuroQCI) is crucial. The SEEWQCI project combines fiber cables, satellite communications and enhanced security rules to build a strong digital shield. Its focus is to protect vital services like power grids and hospitals keeping Europeans' data safe.

CVMay 24, 2024Code
Composed Image Retrieval for Remote Sensing

Bill Psomas, Ioannis Kakogeorgiou, Nikos Efthymiadis et al.

This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir

CVDec 4, 2024Code
Composed Image Retrieval for Training-Free Domain Conversion

Nikos Efthymiadis, Bill Psomas, Zakaria Laskar et al.

This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: https://github.com/NikosEfth/freedom

CVJun 11, 2025Code
Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

Bill Psomas, Dionysis Christopoulos, Eirini Baltzi et al.

As fine-tuning becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol. Yet, the standard linear probing fails to adequately reflect the potential of models whose pre-training optimizes representations of patch tokens rather than an explicit global representation. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on this, we propose efficient probing (EP), a simple yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Despite its simplicity, EP outperforms linear probing and prior attentive probing approaches across seven benchmarks, generalizes well to diverse pre-training paradigms, and delivers strong low-shot and layer-wise gains. Beyond evaluation, our analysis uncovers emerging properties of EP, such as complementary attention maps, which open new directions for leveraging probing beyond protocol design. Code available at https://github.com/billpsomas/efficient-probing.

CVJul 7, 2024
Addressing single object tracking in satellite imagery through prompt-engineered solutions

Athena Psalta, Vasileios Tsironis, Andreas El Saer et al.

Object tracking in satellite videos remains a complex endeavor in remote sensing due to the intricate and dynamic nature of satellite imagery. Existing state-of-the-art trackers in computer vision integrate sophisticated architectures, attention mechanisms, and multi-modal fusion to enhance tracking accuracy across diverse environments. However, the challenges posed by satellite imagery, such as background variations, atmospheric disturbances, and low-resolution object delineation, significantly impede the precision and reliability of traditional Single Object Tracking (SOT) techniques. Our study delves into these challenges and proposes prompt engineering methodologies, leveraging the Segment Anything Model (SAM) and TAPIR (Tracking Any Point with per-frame Initialization and temporal Refinement), to create a training-free point-based tracking method for small-scale objects on satellite videos. Experiments on the VISO dataset validate our strategy, marking a significant advancement in robust tracking solutions tailored for satellite imagery in remote sensing applications.

CVMay 29, 2025
Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning

Dionysis Christopoulos, Sotiris Spanos, Eirini Baltzi et al.

We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.

CVNov 13, 2024
TRACE: Transformer-based Risk Assessment for Clinical Evaluation

Dionysis Christopoulos, Sotiris Spanos, Valsamis Ntouskos et al.

We present TRACE (Transformer-based Risk Assessment for Clinical Evaluation), a novel method for clinical risk assessment based on clinical data, leveraging the self-attention mechanism for enhanced feature interaction and result interpretation. Our approach is able to handle different data modalities, including continuous, categorical and multiple-choice (checkbox) attributes. The proposed architecture features a shared representation of the clinical data obtained by integrating specialized embeddings of each data modality, enabling the detection of high-risk individuals using Transformer encoder layers. To assess the effectiveness of the proposed method, a strong baseline based on non-negative multi-layer perceptrons (MLPs) is introduced. The proposed method outperforms various baselines widely used in the domain of clinical risk assessment, while effectively handling missing values. In terms of explainability, our Transformer-based method offers easily interpretable results via attention weights, further enhancing the clinicians' decision-making process.

LGJun 9, 2021
It Takes Two to Tango: Mixup for Deep Metric Learning

Shashanka Venkataramanan, Bill Psomas, Ewa Kijak et al.

Metric learning involves learning a discriminative representation such that embeddings of similar classes are encouraged to be close, while embeddings of dissimilar classes are pushed far apart. State-of-the-art methods focus mostly on sophisticated loss functions or mining strategies. On the one hand, metric learning losses consider two or more examples at a time. On the other hand, modern data augmentation methods for classification consider two or more examples at a time. The combination of the two ideas is under-studied. In this work, we aim to bridge this gap and improve representations using mixup, which is a powerful data augmentation approach interpolating two or more examples and corresponding target labels at a time. This task is challenging because unlike classification, the loss functions used in metric learning are not additive over examples, so the idea of interpolating target labels is not straightforward. To the best of our knowledge, we are the first to investigate mixing both examples and target labels for deep metric learning. We develop a generalized formulation that encompasses existing metric learning loss functions and modify it to accommodate for mixup, introducing Metric Mix, or Metrix. We also introduce a new metric - utilization, to demonstrate that by mixing examples during training, we are exploring areas of the embedding space beyond the training classes, thereby improving representations. To validate the effect of improved representations, we show that mixing inputs, intermediate representations or embeddings along with target labels significantly outperforms state-of-the-art metric learning methods on four benchmark deep metric learning datasets.

LGApr 3, 2021
Evaluating explainable artificial intelligence methods for multi-label deep learning classification tasks in remote sensing

Ioannis Kakogeorgiou, Konstantinos Karantzalos

Although deep neural networks hold the state-of-the-art in several remote sensing tasks, their black-box operation hinders the understanding of their decisions, concealing any bias and other shortcomings in datasets and model performance. To this end, we have applied explainable artificial intelligence (XAI) methods in remote sensing multi-label classification tasks towards producing human-interpretable explanations and improve transparency. In particular, we utilized and trained deep learning models with state-of-the-art performance in the benchmark BigEarthNet and SEN12MS datasets. Ten XAI methods were employed towards understanding and interpreting models' predictions, along with quantitative metrics to assess and compare their performance. Numerous experiments were performed to assess the overall performance of XAI methods for straightforward prediction cases, competing multiple labels, as well as misclassification cases. According to our findings, Occlusion, Grad-CAM and Lime were the most interpretable and reliable XAI methods. However, none delivers high-resolution outputs, while apart from Grad-CAM, both Lime and Occlusion are computationally expensive. We also highlight different aspects of XAI performance and elaborate with insights on black-box decisions in order to improve transparency, understand their behavior and reveal, as well, datasets' particularities.

CVOct 17, 2019
Detecting Urban Changes with Recurrent Neural Networks from Multitemporal Sentinel-2 Data

Maria Papadomanolaki, Sagar Verma, Maria Vakalopoulou et al.

\begin{abstract} The advent of multitemporal high resolution data, like the Copernicus Sentinel-2, has enhanced significantly the potential of monitoring the earth's surface and environmental dynamics. In this paper, we present a novel deep learning framework for urban change detection which combines state-of-the-art fully convolutional networks (similar to U-Net) for feature representation and powerful recurrent networks (such as LSTMs) for temporal modeling. We report our results on the recently publicly available bi-temporal Onera Satellite Change Detection (OSCD) Sentinel-2 dataset, enhancing the temporal information with additional images of the same region on different dates. Moreover, we evaluate the performance of the recurrent networks as well as the use of the additional dates on the unseen test-set using an ensemble cross-validation strategy. All the developed models during the validation phase have scored an overall accuracy of more than 95%, while the use of LSTMs and further temporal information, boost the F1 rate of the change class by an additional 1.5%.

CVFeb 27, 2019
Shallow Water Bathymetry Mapping from UAV Imagery based on Machine Learning

Panagiotis Agrafiotis, Dimitrios Skarlatos, Andreas Georgopoulos et al.

The determination of accurate bathymetric information is a key element for near offshore activities, hydrological studies such as coastal engineering applications, sedimentary processes, hydrographic surveying as well as archaeological mapping and biological research. UAV imagery processed with Structure from Motion (SfM) and Multi View Stereo (MVS) techniques can provide a low-cost alternative to established shallow seabed mapping techniques offering as well the important visual information. Nevertheless, water refraction poses significant challenges on depth determination. Till now, this problem has been addressed through customized image-based refraction correction algorithms or by modifying the collinearity equation. In this paper, in order to overcome the water refraction errors, we employ machine learning tools that are able to learn the systematic underestimation of the estimated depths. In the proposed approach, based on known depth observations from bathymetric LiDAR surveys, an SVR model was developed able to estimate more accurately the real depths of point clouds derived from SfM-MVS procedures. Experimental results over two test sites along with the performed quantitative validation indicated the high potential of the developed approach.