Savas Ozkan

CV
h-index7
20papers
329citations
Novelty52%
AI Score53

20 Papers

28.1CVApr 15
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

Ahmed Bourouis, Savas Ozkan, Andrea Maracani et al.

We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.

CVDec 12, 2020Code
Spectral Unmixing With Multinomial Mixture Kernel and Wasserstein Generative Adversarial Loss

Savas Ozkan, Gozde Bozdagi Akar

This study proposes a novel framework for spectral unmixing by using 1D convolution kernels and spectral uncertainty. High-level representations are computed from data, and they are further modeled with the Multinomial Mixture Model to estimate fractions under severe spectral uncertainty. Furthermore, a new trainable uncertainty term based on a nonlinear neural network model is introduced in the reconstruction step. All uncertainty models are optimized by Wasserstein Generative Adversarial Network (WGAN) to improve stability and capture uncertainty. Experiments are performed on both real and synthetic datasets. The results validate that the proposed method obtains state-of-the-art performance, especially for the real datasets compared to the baselines. Project page at: https://github.com/savasozkan/dscn.

15.2CLApr 22
Decoding Text Spans for Efficient and Accurate Named-Entity Recognition

Andrea Maracani, Savas Ozkan, Junyi Zhu et al.

Named Entity Recognition (NER) is a key component in industrial information extraction pipelines, where systems must satisfy strict latency and throughput constraints in addition to strong accuracy. State-of-the-art NER accuracy is often achieved by span-based frameworks, which construct span representations from token encodings and classify candidate spans. However, many span-based methods enumerate large numbers of candidates and process each candidate with marker-augmented inputs, substantially increasing inference cost and limiting scalability in large-scale deployments. In this work, we propose SpanDec, an efficient span-based NER framework that targets this bottleneck. Our main insight is that span representation interactions can be computed effectively at the final transformer stage, avoiding redundant computation in earlier layers via a lightweight decoder dedicated to span representations. We further introduce a span filtering mechanism during enumeration to prune unlikely candidates before expensive processing. Across multiple benchmarks, SpanDec matches competitive span-based baselines while improving throughput and reducing computational cost, yielding a better accuracy-efficiency trade-off suitable for high-volume serving and on-device applications.

CVMar 20, 2025
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

Andrea Maracani, Savas Ozkan, Sijun Cho et al.

Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

CVNov 20, 2025
Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

Sinan Mutlu, Georgios F. Angelis, Savas Ozkan et al.

Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.

CLOct 8, 2025
Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER

Junyi Zhu, Savas Ozkan, Andrea Maracani et al.

Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that naïve multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.

CVMay 3, 2025
Efficient 3D Full-Body Motion Generation from Sparse Tracking Inputs with Temporal Windows

Georgios Fotios Angelis, Savas Ozkan, Sinan Mutlu et al.

To have a seamless user experience on immersive AR/VR applications, the importance of efficient and effective Neural Network (NN) models is undeniable, since missing body parts that cannot be captured by limited sensors should be generated using these models for a complete 3D full-body reconstruction in virtual environment. However, the state-of-the-art NN-models are typically computational expensive and they leverage longer sequences of sparse tracking inputs to generate full-body movements by capturing temporal context. Inevitably, longer sequences increase the computation overhead and introduce noise in longer temporal dependencies that adversely affect the generation performance. In this paper, we propose a novel Multi-Layer Perceptron (MLP)-based method that enhances the overall performance while balancing the computational cost and memory overhead for efficient 3D full-body generation. Precisely, we introduce a NN-mechanism that divides the longer sequence of inputs into smaller temporal windows. Later, the current motion is merged with the information from these windows through latent representations to utilize the past context for the generation. Our experiments demonstrate that generation accuracy of our method with this NN-mechanism is significantly improved compared to the state-of-the-art methods while greatly reducing computational costs and memory overhead, making our method suitable for resource-constrained devices.

LGMar 26, 2025
Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models

Junyi Zhu, Ruicong Yao, Taha Ceritli et al.

Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new opportunities for model training, as the two regimes offer complementary trade-offs: decentralized data is abundant but subject to heterogeneity and communication constraints, while centralized data, though limited in volume and potentially unrepresentative, enables better curation and high-throughput access. Despite its potential, effectively combining these paradigms remains challenging, and few frameworks are tailored to hybrid data regimes. To address this, we propose a novel framework that constructs a model atlas from decentralized models and leverages centralized data to refine a global model within this structured space. The refined model is then used to reinitialize the decentralized models. Our method synergizes federated learning (to exploit decentralized data) and model merging (to utilize centralized data), enabling effective training under hybrid data availability. Theoretically, we show that our approach achieves faster convergence than methods relying solely on decentralized data, due to variance reduction in the merging process. Extensive experiments demonstrate that our framework consistently outperforms purely centralized, purely decentralized, and existing hybrid-adaptable methods. Notably, our method remains robust even when the centralized and decentralized data domains differ or when decentralized data contains noise, significantly broadening its applicability.

CVMar 24, 2025
Efficient and Accurate Scene Text Recognition with Cascaded-Transformers

Savas Ozkan, Andrea Maracani, Hyowon Kim et al.

In recent years, vision transformers with text decoder have demonstrated remarkable performance on Scene Text Recognition (STR) due to their ability to capture long-range dependencies and contextual relationships with high learning capacity. However, the computational and memory demands of these models are significant, limiting their deployment in resource-constrained applications. To address this challenge, we propose an efficient and accurate STR system. Specifically, we focus on improving the efficiency of encoder models by introducing a cascaded-transformers structure. This structure progressively reduces the vision token size during the encoding step, effectively eliminating redundant tokens and reducing computational cost. Our experimental results confirm that our STR system achieves comparable performance to state-of-the-art baselines while substantially decreasing computational requirements. In particular, for large-models, the accuracy remains same, 92.77 to 92.68, while computational complexity is almost halved with our structure.

CVMay 9, 2021
Binarized Weight Error Networks With a Transition Regularization Term

Savas Ozkan, Gozde Bozdagi Akar

This paper proposes a novel binarized weight network (BT) for a resource-efficient neural structure. The proposed model estimates a binary representation of weights by taking into account the approximation error with an additional term. This model increases representation capacity and stability, particularly for shallow networks, while the computation load is theoretically reduced. In addition, a novel regularization term is introduced that is suitable for all threshold-based binary precision networks. This term penalizes the trainable parameters that are far from the thresholds at which binary transitions occur. This step promotes a swift modification for binary-precision responses at train time. The experimental results are carried out for two sets of tasks: visual classification and visual inverse problems. Benchmarks for Cifar10, SVHN, Fashion, ImageNet2012, Set5, Set14, Urban and BSD100 datasets show that our method outperforms all counterparts with binary precision.

IVJun 8, 2020
Cross-Domain Segmentation with Adversarial Loss and Covariate Shift for Biomedical Imaging

Bora Baydar, Savas Ozkan, A. Emre Kavur et al.

Despite the widespread use of deep learning methods for semantic segmentation of images that are acquired from a single source, clinicians often use multi-domain data for a detailed analysis. For instance, CT and MRI have advantages over each other in terms of imaging quality, artifacts, and output characteristics that lead to differential diagnosis. The capacity of current segmentation techniques is only allow to work for an individual domain due to their differences. However, the models that are capable of working on all modalities are essentially needed for a complete solution. Furthermore, robustness is drastically affected by the number of samples in the training step, especially for deep learning models. Hence, there is a necessity that all available data regardless of data domain should be used for reliable methods. For this purpose, this manuscript aims to implement a novel model that can learn robust representations from cross-domain data by encapsulating distinct and shared patterns from different modalities. Precisely, covariate shift property is retained with structural modification and adversarial loss where sparse and rich representations are obtained. Hence, a single parameter set is used to perform cross-domain segmentation task. The superiority of the proposed method is that no information related to modalities are provided in either training or inference phase. The tests on CT and MRI liver data acquired in routine clinical workflows show that the proposed model outperforms all other baseline with a large margin. Experiments are also conducted on Covid-19 dataset that it consists of CT data where significant intra-class visual differences are observed. Similarly, the proposed method achieves the best performance.

CVNov 28, 2018
Automatic Liver Segmentation with Adversarial Loss and Convolutional Neural Network

Bora Baydar, Savas Ozkan, Gozde Bozdagi Akar

Automatic segmentation of medical images is among most demanded works in the medical information field since it saves time of the experts in the field and avoids human error factors. In this work, a method based on Conditional Adversarial Networks and Fully Convolutional Networks is proposed for the automatic segmentation of the liver MRIs. The proposed method, without any post-processing, is achieved the second place in the SIU Liver Segmentation Challenge 2018, data of which is provided by Dokuz Eylül University. In this paper, some improvements for the post-processing step are also proposed and it is shown that with these additions, the method outperforms other baseline methods.

CVAug 3, 2018
Improved Deep Spectral Convolution Network For Hyperspectral Unmixing With Multinomial Mixture Kernel and Endmember Uncertainty

Savas Ozkan, Gozde Bozdagi Akar

In this study, we propose a novel framework for hyperspectral unmixing by using an improved deep spectral convolution network (DSCN++) combined with endmember uncertainty. DSCN++ is used to compute high-level representations which are further modeled with Multinomial Mixture Model to estimate abundance maps. In the reconstruction step, a new trainable uncertainty term based on a nonlinear neural network model is introduced to provide robustness to endmember uncertainty. For the optimization of the coefficients of the multinomial model and the uncertainty term, Wasserstein Generative Adversarial Network (WGAN) is exploited to improve stability and to capture uncertainty. Experiments are performed on both real and synthetic datasets. The results validate that the proposed method obtains state-of-the-art hyperspectral unmixing performance particularly on the real datasets compared to the baseline techniques.

CVAug 3, 2018
Exploiting Local Indexing and Deep Feature Confidence Scores for Fast Image-to-Video Search

Savas Ozkan, Gozde Bozdagi Akar

The cost-effective visual representation and fast query-by-example search are two challenging goals that should be maintained for web-scale visual retrieval tasks on moderate hardware. This paper introduces a fast and robust method that ensures both of these goals by obtaining state-of-the-art performance for an image-to-video search scenario. Hence, we present critical enhancements to well-known indexing and visual representation techniques by promoting faster, better and moderate retrieval performance. We also boost the superiority of our method for some visual challenges by exploiting individual decisions of local and global descriptors at query time. For instance, local content descriptors represent copied/duplicated scenes with large geometric deformations such as scale, orientation and affine transformation. In contrast, the use of global content descriptors is more practical for near-duplicate and semantic searches. Experiments are conducted on a large-scale Stanford I2V dataset. The experimental results show that our method is useful in terms of complexity and query processing time for large-scale visual retrieval scenarios, even if local and global representations are used together. The proposed method is superior and achieves state-of-the-art performance based on the mean average precision (MAP) score of this dataset. Lastly, we report additional MAP scores after updating the ground annotations unveiled by retrieval results of the proposed method, and it shows that the actual performance.

CVJun 22, 2018
KinshipGAN: Synthesizing of Kinship Faces From Family Photos by Regularizing a Deep Face Network

Savas Ozkan, Akin Ozkan

In this paper, we propose a kinship generator network that can synthesize a possible child face by analyzing his/her parent's photo. For this purpose, we focus on to handle the scarcity of kinship datasets throughout the paper by proposing novel solutions in particular. To extract robust features, we integrate a pre-trained face model to the kinship face generator. Moreover, the generator network is regularized with an additional face dataset and adversarial loss to decrease the overfitting of the limited samples. Lastly, we adapt cycle-domain transformation to attain a more stable results. Experiments are conducted on Families in the Wild (FIW) dataset. The experimental results show that the contributions presented in the paper provide important performance improvements compared to the baseline architecture and our proposed method yields promising perceptual results.

CVJun 22, 2018
Deep Spectral Convolution Network for HyperSpectral Unmixing

Savas Ozkan, Gozde Bozdagi Akar

In this paper, we propose a novel hyperspectral unmixing technique based on deep spectral convolution networks (DSCN). Particularly, three important contributions are presented throughout this paper. First, fully-connected linear operation is replaced with spectral convolutions to extract local spectral characteristics from hyperspectral signatures with a deeper network architecture. Second, instead of batch normalization, we propose a spectral normalization layer which improves the selectivity of filters by normalizing their spectral responses. Third, we introduce two fusion configurations that produce ideal abundance maps by using the abstract representations computed from previous layers. In experiments, we use two real datasets to evaluate the performance of our method with other baseline techniques. The experimental results validate that the proposed method outperforms baselines based on Root Mean Square Error (RMSE).

CVJan 26, 2018
Cloud Detection From RGB Color Remote Sensing Images With Deep Pyramid Networks

Savas Ozkan, Mehmet Efendioglu, Caner Demirpolat

Cloud detection from remotely observed data is a critical pre-processing step for various remote sensing applications. In particular, this problem becomes even harder for RGB color images, since there is no distinct spectral pattern for clouds, which is directly separable from the Earth surface. In this paper, we adapt a deep pyramid network (DPN) to tackle this problem. For this purpose, the network is enhanced with a pre-trained parameter model at the encoder layer. Moreover, the method is able to obtain accurate pixel-level segmentation and classification results from a set of noisy labeled RGB color images. In order to demonstrate the superiority of the method, we collect and label data with the corresponding cloud/non-cloudy masks acquired from low-orbit Gokturk-2 and RASAT satellites. The experimental results validates that the proposed method outperforms several baselines even for hard cases (e.g. snowy mountains) that are perceptually difficult to distinguish by human eyes.

CVAug 24, 2017
Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction

Savas Ozkan, Gozde Bozdagi Akar

Frame-level visual features are generally aggregated in time with the techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust video-level representation. We here introduce a learnable aggregation technique whose primary objective is to retain short-time temporal structure between frame-level features and their spatial interdependencies in the representation. Also, it can be easily adapted to the cases where there have very scarce training samples. We evaluate the method on a real-fake expression prediction dataset to demonstrate its superiority. Our method obtains 65% score on the test dataset in the official MAP evaluation and there is only one misclassified decision with the best reported result in the Chalearn Challenge (i.e. 66:7%) . Lastly, we believe that this method can be extended to different problems such as action/event recognition in future.

CVAug 6, 2017
EndNet: Sparse AutoEncoder Network for Endmember Extraction and Hyperspectral Unmixing

Savas Ozkan, Berk Kaya, Gozde Bozdagi Akar

Data acquired from multi-channel sensors is a highly valuable asset to interpret the environment for a variety of remote sensing applications. However, low spatial resolution is a critical limitation for previous sensors and the constituent materials of a scene can be mixed in different fractions due to their spatial interactions. Spectral unmixing is a technique that allows us to obtain the material spectral signatures and their fractions from hyperspectral data. In this paper, we propose a novel endmember extraction and hyperspectral unmixing scheme, so called \textit{EndNet}, that is based on a two-staged autoencoder network. This well-known structure is completely enhanced and restructured by introducing additional layers and a projection metric (i.e., spectral angle distance (SAD) instead of inner product) to achieve an optimum solution. Moreover, we present a novel loss function that is composed of a Kullback-Leibler divergence term with SAD similarity and additional penalty terms to improve the sparsity of the estimates. These modifications enable us to set the common properties of endmembers such as non-linearity and sparsity for autoencoder networks. Lastly, due to the stochastic-gradient based approach, the method is scalable for large-scale data and it can be accelerated on Graphical Processing Units (GPUs). To demonstrate the superiority of our proposed method, we conduct extensive experiments on several well-known datasets. The results confirm that the proposed method considerably improves the performance compared to the state-of-the-art techniques in literature.

CVJul 25, 2016
Large-Scale Video Search with Efficient Temporal Voting Structure

Ersin Esen, Savas Ozkan, Ilkay Atil

In this work, we propose a fast content-based video querying system for large-scale video search. The proposed system is distinguished from similar works with two major contributions. First contribution is superiority of joint usage of repeated content representation and efficient hashing mechanisms. Repeated content representation is utilized with a simple yet robust feature, which is based on edge energy of frames. Each of the representation is converted into hash code with Hamming Embedding method for further queries. Second contribution is novel queue-based voting scheme that leads to modest memory requirements with gradual memory allocation capability, contrary to complete brute-force temporal voting schemes. This aspect enables us to make queries on large video databases conveniently, even on commodity computers with limited memory capacity. Our results show that the system can respond to video queries on a large video database with fast query times, high recall rate and very low memory and disk requirements.