Erdem Akagündüz

CV
h-index18
12papers
51citations
Novelty38%
AI Score47

12 Papers

44.3CVApr 17Code
Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

Irem Ulku, Erdem Akagündüz, Ömer Özgür Tanrıöver

Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.

11.4CVMay 19Code
Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities

Irem Ulku, Ö. Özgür Tanrıöver, Erdem Akagündüz

Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling

LGDec 4, 2025Code
TimesNet-Gen: Deep Learning-based Site Specific Strong Motion Generation

Baris Yilmaz, Bevan Deniz Cilgin, Erdem Akagündüz et al.

Effective earthquake risk reduction relies on accurate site-specific evaluations. This requires models that can represent the influence of local site conditions on ground motion characteristics. In this context, data driven approaches that learn site controlled signatures from recorded ground motions offer a promising direction. We address strong ground motion generation from time-domain accelerometer records and introduce the TimesNet-Gen, a time-domain conditional generator. The approach uses a station specific latent bottleneck. We evaluate generation by comparing HVSR curves and fundamental site-frequency $f_0$ distributions between real and generated records per station, and summarize station specificity with a score based on the $f_0$ distribution confusion matrices. TimesNet-Gen achieves strong station-wise alignment and compares favorably with a spectrogram-based conditional VAE baseline for site-specific strong motion synthesis. Our codes are available via https://github.com/brsylmz23/TimesNet-Gen.

CVAug 27, 2024
Deep Learning-based Average Shear Wave Velocity Prediction using Accelerometer Records

Barış Yılmaz, Melek Türkmen, Sanem Meral et al.

Assessing seismic hazards and thereby designing earthquake-resilient structures or evaluating structural damage that has been incurred after an earthquake are important objectives in earthquake engineering. Both tasks require critical evaluation of strong ground motion records, and the knowledge of site conditions at the earthquake stations plays a major role in achieving the aforementioned objectives. Site conditions are generally represented by the time-averaged shear wave velocity in the upper 30 meters of the geological materials (Vs30). Several strong motion stations lack Vs30 measurements resulting in potentially inaccurate assessment of seismic hazards and evaluation of ground motion records. In this study, we present a deep learning-based approach for predicting Vs30 at strong motion station locations using three-channel earthquake records. For this purpose, Convolutional Neural Networks (CNNs) with dilated and causal convolutional layers are used to extract deep features from accelerometer records collected from over 700 stations located in Turkey. In order to overcome the limited availability of labeled data, we propose a two-phase training approach. In the first phase, a CNN is trained to estimate the epicenters, for which ground truth is available for all records. After the CNN is trained, the pre-trained encoder is fine-tuned based on the Vs30 ground truth. The performance of the proposed method is compared with machine learning models that utilize hand-crafted features. The results demonstrate that the deep convolutional encoder based Vs30 prediction model outperforms the machine learning models that rely on hand-crafted features.

CVApr 12, 2024Code
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

Övgü Özdemir, Erdem Akagündüz

Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{https://github.com/ovguyo/captions-in-VQA}.

CVSep 12, 2022
Detecting Driver Drowsiness as an Anomaly Using LSTM Autoencoders

Gülin Tüfekci, Alper Kayabaşi, Erdem Akagündüz et al.

In this paper, an LSTM autoencoder-based architecture is utilized for drowsiness detection with ResNet-34 as feature extractor. The problem is considered as anomaly detection for a single subject; therefore, only the normal driving representations are learned and it is expected that drowsiness representations, yielding higher reconstruction losses, are to be distinguished according to the knowledge of the network. In our study, the confidence levels of normal and anomaly clips are investigated through the methodology of label assignment such that training performance of LSTM autoencoder and interpretation of anomalies encountered during testing are analyzed under varying confidence rates. Our method is experimented on NTHU-DDD and benchmarked with a state-of-the-art anomaly detection method for driver drowsiness. Results show that the proposed model achieves detection rate of 0.8740 area under curve (AUC) and is able to provide significant improvements on certain scenarios.

CVJul 4, 2023
EANet: Enhanced Attribute-based RGBT Tracker Network

Abbas Türkoğlu, Erdem Akagündüz

Tracking objects can be a difficult task in computer vision, especially when faced with challenges such as occlusion, changes in lighting, and motion blur. Recent advances in deep learning have shown promise in challenging these conditions. However, most deep learning-based object trackers only use visible band (RGB) images. Thermal infrared electromagnetic waves (TIR) can provide additional information about an object, including its temperature, when faced with challenging conditions. We propose a deep learning-based image tracking approach that fuses RGB and thermal images (RGBT). The proposed model consists of two main components: a feature extractor and a tracker. The feature extractor encodes deep features from both the RGB and the TIR images. The tracker then uses these features to track the object using an enhanced attribute-based architecture. We propose a fusion of attribute-specific feature selection with an aggregation module. The proposed methods are evaluated on the RGBT234 \cite{LiCLiang2018} and LasHeR \cite{LiLasher2021} datasets, which are the most widely used RGBT object-tracking datasets in the literature. The results show that the proposed system outperforms state-of-the-art RGBT object trackers on these datasets, with a relatively smaller number of parameters.

CVAug 25, 2024
Infrared Domain Adaptation with Zero-Shot Quantization

Burak Sevsay, Erdem Akagündüz

Quantization is one of the most popular techniques for reducing computation time and shrinking model size. However, ensuring the accuracy of quantized models typically involves calibration using training data, which may be inaccessible due to privacy concerns. In such cases, zero-shot quantization, a technique that relies on pretrained models and statistical information without the need for specific training data, becomes valuable. Exploring zero-shot quantization in the infrared domain is important due to the prevalence of infrared imaging in sensitive fields like medical and security applications. In this work, we demonstrate how to apply zero-shot quantization to an object detection model retrained with thermal imagery. We use batch normalization statistics of the model to distill data for calibration. RGB image-trained models and thermal image-trained models are compared in the context of zero-shot quantization. Our investigation focuses on the contributions of mean and standard deviation statistics to zero-shot quantization performance. Additionally, we compare zero-shot quantization with post-training quantization on a thermal dataset. We demonstrated that zero-shot quantization successfully generates data that represents the training dataset for the quantization of object detection models. Our results indicate that our zero-shot quantization framework is effective in the absence of training data and is well-suited for the infrared domain.

LGMar 7, 2025Code
Deep Sequence Models for Predicting Average Shear Wave Velocity from Strong Motion Records

Baris Yilmaz, Erdem Akagündüz, Salih Tileylioglu

This study explores the use of deep learning for predicting the time averaged shear wave velocity in the top 30 m of the subsurface ($V_{s30}$) at strong motion recording stations in Türkiye. $V_{s30}$ is a key parameter in site characterization and, as a result for seismic hazard assessment. However, it is often unavailable due to the lack of direct measurements and is therefore estimated using empirical correlations. Such correlations however are commonly inadequate in capturing complex, site-specific variability and this motivates the need for data-driven approaches. In this study, we employ a hybrid deep learning model combining convolutional neural networks (CNNs) and long short-term memory (LSTM) networks to capture both spatial and temporal dependencies in strong motion records. Furthermore, we explore how using different parts of the signal influence our deep learning model. Our results suggest that the hybrid approach effectively learns complex, nonlinear relationships within seismic signals. We observed that an improved P-wave arrival time model increased the prediction accuracy of $V_{s30}$. We believe the study provides valuable insights into improving $V_{s30}$ predictions using a CNN-LSTM framework, demonstrating its potential for improving site characterization for seismic studies. Our codes are available via this repo: https://github.com/brsylmz23/CNNLSTM_DeepEQ

SPMar 12, 2024
Exploring Challenges in Deep Learning of Single-Station Ground Motion Records

Ümit Mert Çağlar, Baris Yilmaz, Melek Türkmen et al.

Contemporary deep learning models have demonstrated promising results across various applications within seismology and earthquake engineering. These models rely primarily on utilizing ground motion records for tasks such as earthquake event classification, localization, earthquake early warning systems, and structural health monitoring. However, the extent to which these models truly extract meaningful patterns from these complex time-series signals remains underexplored. In this study, our objective is to evaluate the degree to which auxiliary information, such as seismic phase arrival times or seismic station distribution within a network, dominates the process of deep learning from ground motion records, potentially hindering its effectiveness. Our experimental results reveal a strong dependence on the highly correlated Primary (P) and Secondary (S) phase arrival times. These findings expose a critical gap in the current research landscape, highlighting the lack of robust methodologies for deep learning from single-station ground motion recordings that do not rely on auxiliary inputs.

CVFeb 1, 2024
FuseFormer: A Transformer for Visual and Thermal Image Fusion

Aytekin Erdogan, Erdem Akagündüz

Due to the lack of a definitive ground truth for the image fusion problem, the loss functions are structured based on evaluation metrics, such as the structural similarity index measure (SSIM). However, in doing so, a bias is introduced toward the SSIM and, consequently, the input visual band image. The objective of this study is to propose a novel methodology for the image fusion problem that mitigates the limitations associated with using classical evaluation metrics as loss functions. Our approach integrates a transformer-based multi-scale fusion strategy that adeptly addresses local and global context information. This integration not only refines the individual components of the image fusion process but also significantly enhances the overall efficacy of the method. Our proposed method follows a two-stage training approach, where an auto-encoder is initially trained to extract deep features at multiple scales in the first stage. For the second stage, we integrate our fusion block and change the loss function as mentioned. The multi-scale features are fused using a combination of Convolutional Neural Networks (CNNs) and Transformers. The CNNs are utilized to capture local features, while the Transformer handles the integration of general context features. Through extensive experiments on various benchmark datasets, our proposed method, along with the novel loss function definition, demonstrates superior performance compared to other competitive fusion algorithms.

LGJul 6, 2021
Dynamical System Parameter Identification using Deep Recurrent Cell Networks

Erdem Akagündüz, Oguzhan Cifdaloz

In this paper, we investigate the parameter identification problem in dynamical systems through a deep learning approach. Focusing mainly on second-order, linear time-invariant dynamical systems, the topic of damping factor identification is studied. By utilizing a six-layer deep neural network with different recurrent cells, namely GRUs, LSTMs or BiLSTMs; and by feeding input-output sequence pairs captured from a dynamical system simulator, we search for an effective deep recurrent architecture in order to resolve damping factor identification problem. Our study results show that, although previously not utilized for this task in the literature, bidirectional gated recurrent cells (BiLSTMs) provide better parameter identification results when compared to unidirectional gated recurrent memory cells such as GRUs and LSTM. Thus, indicating that an input-output sequence pair of finite length, collected from a dynamical system and when observed anachronistically, may carry information in both time directions for prediction of a dynamical systems parameter.