Osman Tursun

CV
h-index65
13papers
272citations
Novelty45%
AI Score39

13 Papers

CVJul 5, 2022Code
SESS: Saliency Enhancing with Scaling and Sliding

Osman Tursun, Simon Denman, Sridha Sridharan et al.

High-quality saliency maps are essential in several machine learning application areas including explainable AI and weakly supervised object detection and segmentation. Many techniques have been developed to generate better saliency using neural networks. However, they are often limited to specific saliency visualisation methods or saliency issues. We propose a novel saliency enhancing approach called SESS (Saliency Enhancing with Scaling and Sliding). It is a method and model agnostic extension to existing saliency map generation methods. With SESS, existing saliency approaches become robust to scale variance, multiple occurrences of target objects, presence of distractors and generate less noisy and more discriminative saliency maps. SESS improves saliency by fusing saliency maps extracted from multiple patches at different scales from different areas, and combines these individual maps using a novel fusion scheme that incorporates channel-wise weights and spatial weighted average. To improve efficiency, we introduce a pre-filtering step that can exclude uninformative saliency maps to improve efficiency while still enhancing overall results. We evaluate SESS on object recognition and detection benchmarks where it achieves significant improvement. The code is released publicly to enable researchers to verify performance and further development. Code is available at: https://github.com/neouyghur/SESS

CVApr 5, 2023
Towards Self-Explainability of Deep Neural Networks with Heatmap Captioning and Large-Language Models

Osman Tursun, Simon Denman, Sridha Sridharan et al.

Heatmaps are widely used to interpret deep neural networks, particularly for computer vision tasks, and the heatmap-based explainable AI (XAI) techniques are a well-researched topic. However, most studies concentrate on enhancing the quality of the generated heatmap or discovering alternate heatmap generation techniques, and little effort has been devoted to making heatmap-based XAI automatic, interactive, scalable, and accessible. To address this gap, we propose a framework that includes two modules: (1) context modelling and (2) reasoning. We proposed a template-based image captioning approach for context modelling to create text-based contextual information from the heatmap and input data. The reasoning module leverages a large language model to provide explanations in combination with specialised knowledge. Our qualitative experiments demonstrate the effectiveness of our framework and heatmap captioning approach. The code for the proposed template-based heatmap captioning approach will be publicly available.

CLFeb 16, 2025
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages

Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili et al.

Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and therefore limits the representativeness of evaluation datasets. While recent efforts focused on building more inclusive MMLU benchmarks, these are conventionally built using machine translation from high-resource languages, which may introduce errors and fail to account for the linguistic and cultural intricacies of the target languages. In this paper, we address the lack of native language MMLU benchmark especially in the under-represented Turkic language family with distinct morphosyntactic and cultural characteristics. We propose two benchmarks for Turkic language MMLU: TUMLU is a comprehensive, multilingual, and natively developed language understanding benchmark specifically designed for Turkic languages. It consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset. Using this dataset, we systematically evaluate a diverse range of open and proprietary multilingual large language models (LLMs), including Claude, Gemini, GPT, and LLaMA, offering an in-depth analysis of their performance across different languages, subjects, and alphabets. To promote further research and development in multilingual language understanding, we release TUMLU-mini and all corresponding evaluation scripts.

CVFeb 11, 2025
PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval

Osman Tursun, Sinan Kalkan, Simon Denman et al.

Zero-shot Composed Image Retrieval (ZS-CIR) enables image search using a reference image and a text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches suffer from three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the \textbf{Prompt Directional Vector (PDV)}, a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) Dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be released upon publication.

CLOct 28, 2025
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey et al. · uw

To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.

LGMay 22, 2024
Part-based Quantitative Analysis for Heatmaps

Osman Tursun, Sinan Kalkan, Simon Denman et al.

Heatmaps have been instrumental in helping understand deep network decisions, and are a common approach for Explainable AI (XAI). While significant progress has been made in enhancing the informativeness and accessibility of heatmaps, heatmap analysis is typically very subjective and limited to domain experts. As such, developing automatic, scalable, and numerical analysis methods to make heatmap-based XAI more objective, end-user friendly, and cost-effective is vital. In addition, there is a need for comprehensive evaluation metrics to assess heatmap quality at a granular level.

CVApr 15, 2021
Learning Regional Attention over Multi-resolution Deep Convolutional Features for Trademark Retrieval

Osman Tursun, Simon Denman, Sridha Sridharan et al.

Large-scale trademark retrieval is an important content-based image retrieval task. A recent study shows that off-the-shelf deep features aggregated with Regional-Maximum Activation of Convolutions (R-MAC) achieve state-of-the-art results. However, R-MAC suffers in the presence of background clutter/trivial regions and scale variance, and discards important spatial information. We introduce three simple but effective modifications to R-MAC to overcome these drawbacks. First, we propose the use of both sum and max pooling to minimise the loss of spatial information. We also employ domain-specific unsupervised soft-attention to eliminate background clutter and unimportant regions. Finally, we add multi-resolution inputs to enhance the scale-invariance of R-MAC. We evaluate these three modifications on the million-scale METU dataset. Our results show that all modifications bring non-trivial improvements, and surpass previous state-of-the-art results.

CVFeb 8, 2021
An Efficient Framework for Zero-Shot Sketch-Based Image Retrieval

Osman Tursun, Simon Denman, Sridha Sridharan et al.

Recently, Zero-shot Sketch-based Image Retrieval (ZS-SBIR) has attracted the attention of the computer vision community due to it's real-world applications, and the more realistic and challenging setting than found in SBIR. ZS-SBIR inherits the main challenges of multiple computer vision problems including content-based Image Retrieval (CBIR), zero-shot learning and domain adaptation. The majority of previous studies using deep neural networks have achieved improved results through either projecting sketch and images into a common low-dimensional space or transferring knowledge from seen to unseen classes. However, those approaches are trained with complex frameworks composed of multiple deep convolutional neural networks (CNNs) and are dependent on category-level word labels. This increases the requirements on training resources and datasets. In comparison, we propose a simple and efficient framework that does not require high computational training resources, and can be trained on datasets without semantic categorical labels. Furthermore, at training and inference stages our method only uses a single CNN. In this work, a pre-trained ImageNet CNN (e.g., ResNet50) is fine-tuned with three proposed learning objects: domain-aware quadruplet loss, semantic classification loss, and semantic knowledge preservation loss. The domain-aware quadruplet and semantic classification losses are introduced to learn discriminative, semantic and domain invariant features through considering ZS-SBIR as object detection and verification problem. ...

CVFeb 5, 2020
Learning Test-time Augmentation for Content-based Image Retrieval

Osman Tursun, Simon Denman, Sridha Sridharan et al.

Off-the-shelf convolutional neural network features achieve outstanding results in many image retrieval tasks. However, their invariance to target data is pre-defined by the network architecture and training data. Existing image retrieval approaches require fine-tuning or modification of pre-trained networks to adapt to variations unique to the target data. In contrast, our method enhances the invariance of off-the-shelf features by aggregating features extracted from images augmented at test-time, with augmentations guided by a policy learned through reinforcement learning. The learned policy assigns different magnitudes and weights to the selected transformations, which are selected from a list of image transformations. Policies are evaluated using a metric learning protocol to learn the optimal policy. The model converges quickly and the cost of each policy iteration is minimal as we propose an off-line caching technique to greatly reduce the computational cost of extracting features from augmented images. Experimental results on large trademark retrieval (METU trademark dataset) and landmark retrieval (ROxford5k and RParis6k scene datasets) tasks show that the learned ensemble of transformations is highly effective for improving performance, and is practical, and transferable.

CVDec 16, 2019
MTRNet++: One-stage Mask-based Scene Text Eraser

Osman Tursun, Simon Denman, Rui Zeng et al.

A precise, controllable, interpretable and easily trainable text removal approach is necessary for both user-specific and large-scale text removal applications. To achieve this, we propose a one-stage mask-based text inpainting network, MTRNet++. It has a novel architecture that includes mask-refine, coarse-inpainting and fine-inpainting branches, and attention blocks. With this architecture, MTRNet++ can remove text either with or without an external mask. It achieves state-of-the-art results on both the Oxford and SCUT datasets without using external ground-truth masks. The results of ablation studies demonstrate that the proposed multi-branch architecture with attention blocks is effective and essential. It also demonstrates controllability and interpretability.

CVMar 11, 2019
MTRNet: A Generic Scene Text Eraser

Osman Tursun, Rui Zeng, Simon Denman et al.

Text removal algorithms have been proposed for uni-lingual scripts with regular shapes and layouts. However, to the best of our knowledge, a generic text removal method which is able to remove all or user-specified text regions regardless of font, script, language or shape is not available. Developing such a generic text eraser for real scenes is a challenging task, since it inherits all the challenges of multi-lingual and curved text detection and inpainting. To fill this gap, we propose a mask-based text removal network (MTRNet). MTRNet is a conditional adversarial generative network (cGAN) with an auxiliary mask. The introduced auxiliary mask not only makes the cGAN a generic text eraser, but also enables stable training and early convergence on a challenging large-scale synthetic dataset, initially proposed for text detection in real scenes. What's more, MTRNet achieves state-of-the-art results on several real-world datasets including ICDAR 2013, ICDAR 2017 MLT, and CTW1500, without being explicitly trained on this data, outperforming previous state-of-the-art methods trained directly on these datasets.

CVNov 7, 2018
Component-based Attention for Large-scale Trademark Retrieval

Osman Tursun, Simon Denman, Sabesan Sivapalan et al.

The demand for large-scale trademark retrieval (TR) systems has significantly increased to combat the rise in international trademark infringement. Unfortunately, the ranking accuracy of current approaches using either hand-crafted or pre-trained deep convolution neural network (DCNN) features is inadequate for large-scale deployments. We show in this paper that the ranking accuracy of TR systems can be significantly improved by incorporating hard and soft attention mechanisms, which direct attention to critical information such as figurative elements and reduce attention given to distracting and uninformative elements such as text and background. Our proposed approach achieves state-of-the-art results on a challenging large-scale trademark dataset.

CVJan 20, 2017
A Large-scale Dataset and Benchmark for Similar Trademark Retrieval

Osman Tursun, Cemal Aker, Sinan Kalkan

Trademark retrieval (TR) has become an important yet challenging problem due to an ever increasing trend in trademark applications and infringement incidents. There have been many promising attempts for the TR problem, which, however, fell impracticable since they were evaluated with limited and mostly trivial datasets. In this paper, we provide a large-scale dataset with benchmark queries with which different TR approaches can be evaluated systematically. Moreover, we provide a baseline on this benchmark using the widely-used methods applied to TR in the literature. Furthermore, we identify and correct two important issues in TR approaches that were not addressed before: reversal of contrast, and presence of irrelevant text in trademarks severely affect the TR methods. Lastly, we applied deep learning, namely, several popular Convolutional Neural Network models, to the TR problem. To the best of the authors, this is the first attempt to do so.