CVSep 10, 2023Code
Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-ArtAref Miri Rekavandi, Shima Rashidi, Farid Boussaid et al.
Transformers have rapidly gained popularity in computer vision, especially in the field of object recognition and detection. Upon examining the outcomes of state-of-the-art object detection methods, we noticed that transformers consistently outperformed well-established CNN-based detectors in almost every video or image dataset. While transformer-based approaches remain at the forefront of small object detection (SOD) techniques, this paper aims to explore the performance benefits offered by such extensive networks and identify potential reasons for their SOD superiority. Small objects have been identified as one of the most challenging object types in detection frameworks due to their low visibility. We aim to investigate potential strategies that could enhance transformers' performance in SOD. This survey presents a taxonomy of over 60 research studies on developed transformers for the task of SOD, spanning the years 2020 to 2023. These studies encompass a variety of detection applications, including small object detection in generic images, aerial images, medical images, active millimeter images, underwater images, and videos. We also compile and present a list of 12 large-scale datasets suitable for SOD that were overlooked in previous studies and compare the performance of the reviewed studies using popular metrics such as mean Average Precision (mAP), Frames Per Second (FPS), number of parameters, and more. Researchers can keep track of newer studies on our web page, which is available at \url{https://github.com/arekavandi/Transformer-SOD}.
CVJan 3, 2023Code
Correlation Loss: Enforcing Correlation between Classification and LocalizationFehmi Kahraman, Kemal Oksuz, Sinan Kalkan et al.
Object detectors are conventionally trained by a weighted sum of classification and localization losses. Recent studies (e.g., predicting IoU with an auxiliary head, Generalized Focal Loss, Rank & Sort Loss) have shown that forcing these two loss terms to interact with each other in non-conventional ways creates a useful inductive bias and improves performance. Inspired by these works, we focus on the correlation between classification and localization and make two main contributions: (i) We provide an analysis about the effects of correlation between classification and localization tasks in object detectors. We identify why correlation affects the performance of various NMS-based and NMS-free detectors, and we devise measures to evaluate the effect of correlation and use them to analyze common detectors. (ii) Motivated by our observations, e.g., that NMS-free detectors can also benefit from correlation, we propose Correlation Loss, a novel plug-in loss function that improves the performance of various object detectors by directly optimizing correlation coefficients: E.g., Correlation Loss on Sparse R-CNN, an NMS-free method, yields 1.6 AP gain on COCO and 1.8 AP gain on Cityscapes dataset. Our best model on Sparse R-CNN reaches 51.0 AP without test-time augmentation on COCO test-dev, reaching state-of-the-art. Code is available at https://github.com/fehmikahraman/CorrLoss
CVSep 9, 2024Code
Early-exit Convolutional Neural NetworksEdanur Demir, Emre Akbas
This paper is aimed at developing a method that reduces the computational cost of convolutional neural networks (CNN) during inference. Conventionally, the input data pass through a fixed neural network architecture. However, easy examples can be classified at early stages of processing and conventional networks do not take this into account. In this paper, we introduce 'Early-exit CNNs', EENets for short, which adapt their computational cost based on the input by stopping the inference process at certain exit locations. In EENets, there are a number of exit blocks each of which consists of a confidence branch and a softmax branch. The confidence branch computes the confidence score of exiting (i.e. stopping the inference process) at that location; while the softmax branch outputs a classification probability vector. Both branches are learnable and their parameters are separate. During training of EENets, in addition to the classical classification loss, the computational cost of inference is taken into account as well. As a result, the network adapts its many confidence branches to the inputs so that less computation is spent for easy examples. Inference works as in conventional feed-forward networks, however, when the output of a confidence branch is larger than a certain threshold, the inference stops for that specific example. The idea of EENets is applicable to available CNN architectures such as ResNets. Through comprehensive experiments on MNIST, SVHN, CIFAR10 and Tiny-ImageNet datasets, we show that early-exit (EE) ResNets achieve similar accuracy with their non-EE versions while reducing the computational cost to 20% of the original. Code is available at https://github.com/eksuas/eenets.pytorch
CVJul 19, 2024Code
Bucketed Ranking-based Losses for Efficient Training of Object DetectorsFeyza Yavuz, Baris Can Cam, Adnan Harun Dogan et al.
Ranking-based loss functions, such as Average Precision Loss and Rank&Sort Loss, outperform widely used score-based losses in object detection. These loss functions better align with the evaluation criteria, have fewer hyperparameters, and offer robustness against the imbalance between positive and negative classes. However, they require pairwise comparisons among $P$ positive and $N$ negative predictions, introducing a time complexity of $\mathcal{O}(PN)$, which is prohibitive since $N$ is often large (e.g., $10^8$ in ATSS). Despite their advantages, the widespread adoption of ranking-based losses has been hindered by their high time and space complexities. In this paper, we focus on improving the efficiency of ranking-based loss functions. To this end, we propose Bucketed Ranking-based Losses which group negative predictions into $B$ buckets ($B \ll N$) in order to reduce the number of pairwise comparisons so that time complexity can be reduced. Our method enhances the time complexity, reducing it to $\mathcal{O}(\max (N \log(N), P^2))$. To validate our method and show its generality, we conducted experiments on 2 different tasks, 3 different datasets, 7 different detectors. We show that Bucketed Ranking-based (BR) Losses yield the same accuracy with the unbucketed versions and provide $2\times$ faster training on average. We also train, for the first time, transformer-based object detectors using ranking-based losses, thanks to the efficiency of our BR. When we train CoDETR, a state-of-the-art transformer-based object detector, using our BR Loss, we consistently outperform its original results over several different backbones. Code is available at https://github.com/blisgard/BucketedRankingBasedLosses
CVJan 20, 2023Code
Improving Sketch Colorization using Adversarial Segmentation ConsistencySamet Hicsonmez, Nermin Samet, Emre Akbas et al.
We propose a new method for producing color images from sketches. Current solutions in sketch colorization either necessitate additional user instruction or are restricted to the "paired" translation strategy. We leverage semantic image segmentation from a general-purpose panoptic segmentation network to generate an additional adversarial loss function. The proposed loss function is compatible with any GAN model. Our method is not restricted to datasets with segmentation labels and can be applied to unpaired translation tasks as well. Using qualitative, and quantitative analysis, and based on a user study, we demonstrate the efficacy of our method on four distinct image datasets. On the FID metric, our model improves the baseline by up to 35 points. Our code, pretrained models, scripts to produce newly introduced datasets and corresponding sketch images are available at https://github.com/giddyyupp/AdvSegLoss.
CVOct 7, 2023Code
WAIT: Feature Warping for Animation to Illustration video Translation using GANsSamet Hicsonmez, Nermin Samet, Fidan Samet et al.
In this paper, we explore a new domain for video-to-video translation. Motivated by the availability of animation movies that are adopted from illustrated books for children, we aim to stylize these videos with the style of the original illustrations. Current state-of-the-art video-to-video translation models rely on having a video sequence or a single style image to stylize an input video. We introduce a new problem for video stylizing where an unordered set of images are used. This is a challenging task for two reasons: i) we do not have the advantage of temporal consistency as in video sequences; ii) it is more difficult to obtain consistent styles for video frames from a set of unordered images compared to using a single image. Most of the video-to-video translation methods are built on an image-to-image translation model, and integrate additional networks such as optical flow, or temporal predictors to capture temporal relations. These additional networks make the model training and inference complicated and slow down the process. To ensure temporal coherency in video-to-video style transfer, we propose a new generator network with feature warping layers which overcomes the limitations of the previous methods. We show the effectiveness of our method on three datasets both qualitatively and quantitatively. Code and pretrained models are available at https://github.com/giddyyupp/wait.
CVMar 30Code
Explaining CLIP Zero-shot Predictions Through ConceptsOnat Ozdemir, Anders Christensen, Stephan Alaniz et al.
Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.
CVApr 13, 2022
Does depth estimation help object detection?Bedrettin Cetinkaya, Sinan Kalkan, Emre Akbas
Ground-truth depth, when combined with color data, helps improve object detection accuracy over baseline models that only use color. However, estimated depth does not always yield improvements. Many factors affect the performance of object detection when estimated depth is used. In this paper, we comprehensively investigate these factors with detailed experiments, such as using ground-truth vs. estimated depth, effects of different state-of-the-art depth estimation networks, effects of using different indoor and outdoor RGB-D datasets as training data for depth estimation, and different architectural choices for integrating depth to the base object detector network. We propose an early concatenation strategy of depth, which yields higher mAP than previous works' while using significantly fewer parameters.
CLJul 21, 2024
A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical textsGokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas et al.
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.
CVApr 28, 2022
Representation Recycling for Streaming Video AnalysisCan Ufuk Ertenli, Ramazan Gokberk Cinbis, Emre Akbas
We present StreamDEQ, a method that aims to infer frame-wise representations on videos with minimal per-frame computation. Conventional deep networks do feature extraction from scratch at each frame in the absence of ad-hoc solutions. We instead aim to build streaming recognition models that can natively exploit temporal smoothness between consecutive video frames. We observe that the recently emerging implicit layer models provide a convenient foundation to construct such models, as they define representations as the fixed-points of shallow networks, which need to be estimated using iterative methods. Our main insight is to distribute the inference iterations over the temporal axis by using the most recent representation as a starting point at each frame. This scheme effectively recycles the recent inference computations and greatly reduces the needed processing time. Through extensive experimental analysis, we show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration. Our experiments on video semantic segmentation, video object detection, and human pose estimation in videos show that StreamDEQ achieves on-par accuracy with the baseline while being more than 2-4x faster.
CVFeb 24Code
MatchED: Crisp Edge Detection Using End-to-End, Matching-based SupervisionBedrettin Cetinkaya, Sinan Kalkan, Emre Akbas
Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2--4$\times$ compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20--35\% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at https://cvpr26-matched.github.io.
CVMar 4, 2024Code
RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based LossesBedrettin Cetinkaya, Sinan Kalkan, Emre Akbas
Detecting edges in images suffers from the problems of (P1) heavy imbalance between positive and negative classes as well as (P2) label uncertainty owing to disagreement between different annotators. Existing solutions address P1 using class-balanced cross-entropy loss and dice loss and P2 by only predicting edges agreed upon by most annotators. In this paper, we propose RankED, a unified ranking-based approach that addresses both the imbalance problem (P1) and the uncertainty problem (P2). RankED tackles these two problems with two components: One component which ranks positive pixels over negative pixels, and the second which promotes high confidence edge pixels to have more label certainty. We show that RankED outperforms previous studies and sets a new state-of-the-art on NYUD-v2, BSDS500 and Multi-cue datasets. Code is available at https://ranked-cvpr24.github.io.
CVAug 3, 2022
Character Generation through Self-Supervised VectorizationGokcen Gokceoglu, Emre Akbas
The prevalent approach in self-supervised image generation is to operate on pixel level representations. While this approach can produce high quality images, it cannot benefit from the simplicity and innate quality of vectorization. Here we present a drawing agent that operates on stroke-level representation of images. At each time step, the agent first assesses the current canvas and decides whether to stop or keep drawing. When a 'draw' decision is made, the agent outputs a program indicating the stroke to be drawn. As a result, it produces a final raster image by drawing the strokes on a canvas, using a minimal number of strokes and dynamically deciding when to stop. We train our agent through reinforcement learning on MNIST and Omniglot datasets for unconditional generation and parsing (reconstruction) tasks. We utilize our parsing agent for exemplar generation and type conditioned concept generation in Omniglot challenge without any further training. We present successful results on all three generation tasks and the parsing task. Crucially, we do not need any stroke-level or vector supervision; we only use raster images for training.
CVJul 24, 2025Code
Q-Former Autoencoder: A Modern Framework for Medical Anomaly DetectionFrancesco Dalmonte, Emirhan Bayar, Emre Akbas et al.
Anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated data sets. In this work, we tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models, such as DINO, DINOv2 and Masked Autoencoder. Instead of training encoders from scratch, we directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We propose the usage of the Q-Former architecture as the bottleneck, which enables the control of the length of the reconstruction sequence, while efficiently aggregating multiscale features. Additionally, we incorporate a perceptual loss computed using features from a pretrained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks, achieving state-of-the-art results on BraTS2021, RESC, and RSNA. Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. We release the code and models at https://github.com/emirhanbayar/QFAE.
CVOct 19, 2021Code
Mask-aware IoU for Anchor Assignment in Real-time Instance SegmentationKemal Oksuz, Baris Can Cam, Fehmi Kahraman et al.
This paper presents Mask-aware Intersection-over-Union (maIoU) for assigning anchor boxes as positives and negatives during training of instance segmentation methods. Unlike conventional IoU or its variants, which only considers the proximity of two boxes; maIoU consistently measures the proximity of an anchor box with not only a ground truth box but also its associated ground truth mask. Thus, additionally considering the mask, which, in fact, represents the shape of the object, maIoU enables a more accurate supervision during training. We present the effectiveness of maIoU on a state-of-the-art (SOTA) assigner, ATSS, by replacing IoU operation by our maIoU and training YOLACT, a SOTA real-time instance segmentation method. Using ATSS with maIoU consistently outperforms (i) ATSS with IoU by $\sim 1$ mask AP, (ii) baseline YOLACT with fixed IoU threshold assigner by $\sim 2$ mask AP over different image sizes and (iii) decreases the inference time by $25 \%$ owing to using less anchors. Then, exploiting this efficiency, we devise maYOLACT, a faster and $+6$ AP more accurate detector than YOLACT. Our best model achieves $37.7$ mask AP at $25$ fps on COCO test-dev establishing a new state-of-the-art for real-time instance segmentation. Code is available at https://github.com/kemaloksuz/Mask-aware-IoU
CVJul 24, 2021Code
Rank & Sort Loss for Object Detection and Instance SegmentationKemal Oksuz, Baris Can Cam, Emre Akbas et al.
We propose Rank & Sort (RS) Loss, a ranking-based loss function to train deep object detection and instance segmentation methods (i.e. visual detectors). RS Loss supervises the classifier, a sub-network of these methods, to rank each positive above all negatives as well as to sort positives among themselves with respect to (wrt.) their localisation qualities (e.g. Intersection-over-Union - IoU). To tackle the non-differentiable nature of ranking and sorting, we reformulate the incorporation of error-driven update with backpropagation as Identity Update, which enables us to model our novel sorting error among positives. With RS Loss, we significantly simplify training: (i) Thanks to our sorting objective, the positives are prioritized by the classifier without an additional auxiliary head (e.g. for centerness, IoU, mask-IoU), (ii) due to its ranking-based nature, RS Loss is robust to class imbalance, and thus, no sampling heuristic is required, and (iii) we address the multi-task nature of visual detectors using tuning-free task-balancing coefficients. Using RS Loss, we train seven diverse visual detectors only by tuning the learning rate, and show that it consistently outperforms baselines: e.g. our RS Loss improves (i) Faster R-CNN by ~ 3 box AP and aLRP Loss (ranking-based baseline) by ~ 2 box AP on COCO dataset, (ii) Mask R-CNN with repeat factor sampling (RFS) by 3.5 mask AP (~ 7 AP for rare classes) on LVIS dataset; and also outperforms all counterparts. Code is available at: https://github.com/kemaloksuz/RankSortLoss
CVJun 8, 2021Code
HPRNet: Hierarchical Point Regression for Whole-Body Human Pose EstimationNermin Samet, Emre Akbas
In this paper, we present a new bottom-up one-stage method for whole-body pose estimation, which we call "hierarchical point regression," or HPRNet for short. In standard body pose estimation, the locations of $\sim 17$ major joints on the human body are estimated. Differently, in whole-body pose estimation, the locations of fine-grained keypoints (68 on face, 21 on each hand and 3 on each foot) are estimated as well, which creates a scale variance problem that needs to be addressed. To handle the scale variance among different body parts, we build a hierarchical point representation of body parts and jointly regress them. The relative locations of fine-grained keypoints in each part (e.g. face) are regressed in reference to the center of that part, whose location itself is estimated relative to the person center. In addition, unlike the existing two-stage methods, our method predicts whole-body pose in a constant time independent of the number of people in an image. On the COCO WholeBody dataset, HPRNet significantly outperforms all previous bottom-up methods on the keypoint detection of all whole-body parts (i.e. body, foot, face and hand); it also achieves state-of-the-art results on face (75.4 AP) and hand (50.4 AP) keypoint detection. Code and models are available at \url{https://github.com/nerminsamet/HPRNet}.
CVApr 14, 2021Code
HoughNet: Integrating near and long-range evidence for visual detectionNermin Samet, Samet Hicsonmez, Emre Akbas
This paper presents HoughNet, a one-stage, anchor-free, voting-based, bottom-up object detection method. Inspired by the Generalized Hough Transform, HoughNet determines the presence of an object at a certain location by the sum of the votes cast on that location. Votes are collected from both near and long-distance locations based on a log-polar vote field. Thanks to this voting mechanism, HoughNet is able to integrate both near and long-range, class-conditional evidence for visual recognition, thereby generalizing and enhancing current object detection methodology, which typically relies on only local evidence. On the COCO dataset, HoughNet's best model achieves $46.4$ $AP$ (and $65.1$ $AP_{50}$), performing on par with the state-of-the-art in bottom-up object detection and outperforming most major one-stage and two-stage methods. We further validate the effectiveness of our proposal in other visual detection tasks, namely, video object detection, instance segmentation, 3D object detection and keypoint detection for human pose estimation, and an additional "labels to photo" image generation task, where the integration of our voting module consistently improves performance in all cases. Code is available at https://github.com/nerminsamet/houghnet.
CVFeb 11, 2021Code
Adversarial Segmentation Loss for Sketch ColorizationSamet Hicsonmez, Nermin Samet, Emre Akbas et al.
We introduce a new method for generating color images from sketches or edge maps. Current methods either require some form of additional user-guidance or are limited to the "paired" translation approach. We argue that segmentation information could provide valuable guidance for sketch colorization. To this end, we propose to leverage semantic image segmentation, as provided by a general purpose panoptic segmentation network, to create an additional adversarial loss function. Our loss function can be integrated to any baseline GAN model. Our method is not limited to datasets that contain segmentation labels, and it can be trained for "unpaired" translation tasks. We show the effectiveness of our method on four different datasets spanning scene level indoor, outdoor, and children book illustration images using qualitative, quantitative and user study analysis. Our model improves its baseline up to 35 points on the FID metric. Our code and pretrained models can be found at https://github.com/giddyyupp/AdvSegLoss.
CVNov 21, 2020Code
One Metric to Measure them All: Localisation Recall Precision (LRP) for Evaluating Visual Detection TasksKemal Oksuz, Baris Can Cam, Sinan Kalkan et al.
Despite being widely used as a performance measure for visual detection tasks, Average Precision (AP) is limited in (i) reflecting localisation quality, (ii) interpretability and (iii) robustness to the design choices regarding its computation, and its applicability to outputs without confidence scores. Panoptic Quality (PQ), a measure proposed for evaluating panoptic segmentation (Kirillov et al., 2019), does not suffer from these limitations but is limited to panoptic segmentation. In this paper, we propose Localisation Recall Precision (LRP) Error as the average matching error of a visual detector computed based on both its localisation and classification qualities for a given confidence score threshold. LRP Error, initially proposed only for object detection by Oksuz et al. (2018), does not suffer from the aforementioned limitations and is applicable to all visual detection tasks. We also introduce Optimal LRP (oLRP) Error as the minimum LRP Error obtained over confidence scores to evaluate visual detectors and obtain optimal thresholds for deployment. We provide a detailed comparative analysis of LRP Error with AP and PQ, and use nearly 100 state-of-the-art visual detectors from seven visual detection tasks (i.e. object detection, keypoint detection, instance segmentation, panoptic segmentation, visual relationship detection, zero-shot detection and generalised zero-shot detection) using ten datasets to empirically show that LRP Error provides richer and more discriminative information than its counterparts. Code available at: https://github.com/kemaloksuz/LRP-Error
CVSep 28, 2020Code
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object DetectionKemal Oksuz, Baris Can Cam, Emre Akbas et al.
We propose average Localisation-Recall-Precision (aLRP), a unified, bounded, balanced and ranking-based loss function for both classification and localisation tasks in object detection. aLRP extends the Localisation-Recall-Precision (LRP) performance metric (Oksuz et al., 2018) inspired from how Average Precision (AP) Loss extends precision to a ranking-based loss function for classification (Chen et al., 2020). aLRP has the following distinct advantages: (i) aLRP is the first ranking-based loss function for both classification and localisation tasks. (ii) Thanks to using ranking for both tasks, aLRP naturally enforces high-quality localisation for high-precision classification. (iii) aLRP provides provable balance between positives and negatives. (iv) Compared to on average $\sim$6 hyperparameters in the loss functions of state-of-the-art detectors, aLRP Loss has only one hyperparameter, which we did not tune in practice. On the COCO dataset, aLRP Loss improves its ranking-based predecessor, AP Loss, up to around $5$ AP points, achieves $48.9$ AP without test time augmentation and outperforms all one-stage detectors. Code available at: https://github.com/kemaloksuz/aLRPLoss .
CVAug 3, 2020Code
Reducing Label Noise in Anchor-Free Object DetectionNermin Samet, Samet Hicsonmez, Emre Akbas
Current anchor-free object detectors label all the features that spatially fall inside a predefined central region of a ground-truth box as positive. This approach causes label noise during training, since some of these positively labeled features may be on the background or an occluder object, or they are simply not discriminative features. In this paper, we propose a new labeling strategy aimed to reduce the label noise in anchor-free detectors. We sum-pool predictions stemming from individual features into a single prediction. This allows the model to reduce the contributions of non-discriminatory features during training. We develop a new one-stage, anchor-free object detector, PPDet, to employ this labeling strategy during training and a similar prediction pooling method during inference. On the COCO dataset, PPDet achieves the best performance among anchor-free top-down detectors and performs on-par with the other state-of-the-art methods. It also outperforms all major one-stage and two-stage methods in small object detection (${AP}_{S}$ $31.4$). Code is available at https://github.com/nerminsamet/ppdet
CVJul 5, 2020Code
HoughNet: Integrating near and long-range evidence for bottom-up object detectionNermin Samet, Samet Hicsonmez, Emre Akbas
This paper presents HoughNet, a one-stage, anchor-free, voting-based, bottom-up object detection method. Inspired by the Generalized Hough Transform, HoughNet determines the presence of an object at a certain location by the sum of the votes cast on that location. Votes are collected from both near and long-distance locations based on a log-polar vote field. Thanks to this voting mechanism, HoughNet is able to integrate both near and long-range, class-conditional evidence for visual recognition, thereby generalizing and enhancing current object detection methodology, which typically relies on only local evidence. On the COCO dataset, HoughNet's best model achieves 46.4 $AP$ (and 65.1 $AP_{50}$), performing on par with the state-of-the-art in bottom-up object detection and outperforming most major one-stage and two-stage methods. We further validate the effectiveness of our proposal in another task, namely, "labels to photo" image generation by integrating the voting module of HoughNet to two different GAN models and showing that the accuracy is significantly improved in both cases. Code is available at https://github.com/nerminsamet/houghnet.
CVFeb 13, 2020Code
GANILLA: Generative Adversarial Networks for Image to Illustration TranslationSamet Hicsonmez, Nermin Samet, Emre Akbas et al.
In this paper, we explore illustrations in children's books as a new domain in unpaired image-to-image translation. We show that although the current state-of-the-art image-to-image translation models successfully transfer either the style or the content, they fail to transfer both at the same time. We propose a new generator network to address this issue and show that the resulting network strikes a better balance between style and content. There are no well-defined or agreed-upon evaluation metrics for unpaired image-to-image translation. So far, the success of image translation models has been based on subjective, qualitative visual comparison on a limited number of images. To address this problem, we propose a new framework for the quantitative evaluation of image-to-illustration models, where both content and style are taken into account using separate classifiers. In this new evaluation framework, our proposed model performs better than the current state-of-the-art models on the illustrations dataset. Our code and pretrained models can be found at https://github.com/giddyyupp/ganilla.
CVSep 21, 2019Code
Generating Positive Bounding Boxes for Balanced Training of Object DetectorsKemal Oksuz, Baris Can Cam, Emre Akbas et al.
Two-stage deep object detectors generate a set of regions-of-interest (RoI) in the first stage, then, in the second stage, identify objects among the proposed RoIs that sufficiently overlap with a ground truth (GT) box. The second stage is known to suffer from a bias towards RoIs that have low intersection-over-union (IoU) with the associated GT boxes. To address this issue, we first propose a sampling method to generate bounding boxes (BB) that overlap with a given reference box more than a given IoU threshold. Then, we use this BB generation method to develop a positive RoI (pRoI) generator that produces RoIs following any desired spatial or IoU distribution, for the second-stage. We show that our pRoI generator is able to simulate other sampling methods for positive examples such as hard example mining and prime sampling. Using our generator as an analysis tool, we show that (i) IoU imbalance has an adverse effect on performance, (ii) hard positive example mining improves the performance only for certain input IoU distributions, and (iii) the imbalance among the foreground classes has an adverse effect on performance and that it can be alleviated at the batch level. Finally, we train Faster R-CNN using our pRoI generator and, compared to conventional training, obtain better or on-par performance for low IoUs and significant improvements when trained for higher IoUs for Pascal VOC and MS COCO datasets. The code is available at: https://github.com/kemaloksuz/BoundingBoxGenerator.
CVAug 31, 2019Code
Imbalance Problems in Object Detection: A ReviewKemal Oksuz, Baris Can Cam, Sinan Kalkan et al.
In this paper, we present a comprehensive review of the imbalance problems in object detection. To analyze the problems in a systematic manner, we introduce a problem-based taxonomy. Following this taxonomy, we discuss each problem in depth and present a unifying yet critical perspective on the solutions in the literature. In addition, we identify major open issues regarding the existing imbalance problems as well as imbalance problems that have not been discussed before. Moreover, in order to keep our review up to date, we provide an accompanying webpage which catalogs papers addressing imbalance problems, according to our problem-based taxonomy. Researchers can track newer studies on this webpage available at: https://github.com/kemaloksuz/ObjectDetectionImbalance .
CVMar 6, 2019Code
Self-Supervised Learning of 3D Human Pose using Multi-view GeometryMuhammed Kocabas, Salih Karagoz, Emre Akbas
Training accurate 3D human pose estimators requires large amount of 3D ground-truth data which is costly to collect. Various weakly or self supervised pose estimation methods have been proposed due to lack of 3D data. Nevertheless, these methods, in addition to 2D ground-truth poses, require either additional supervision in various forms (e.g. unpaired 3D ground truth data, a small subset of labels) or the camera parameters in multiview settings. To address these problems, we present EpipolarPose, a self-supervised learning method for 3D human pose estimation, which does not need any 3D ground-truth data or camera extrinsics. During training, EpipolarPose estimates 2D poses from multi-view images, and then, utilizes epipolar geometry to obtain a 3D pose and camera geometry which are subsequently used to train a 3D pose estimator. We demonstrate the effectiveness of our approach on standard benchmark datasets i.e. Human3.6M and MPI-INF-3DHP where we set the new state-of-the-art among weakly/self-supervised methods. Furthermore, we propose a new performance measure Pose Structure Score (PSS) which is a scale invariant, structure aware measure to evaluate the structural plausibility of a pose with respect to its ground truth. Code and pretrained models are available at https://github.com/mkocabas/EpipolarPose
CVJul 11, 2018Code
MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual NetworkMuhammed Kocabas, Salih Karagoz, Emre Akbas
In this paper, we present MultiPoseNet, a novel bottom-up multi-person pose estimation architecture that combines a multi-task model with a novel assignment method. MultiPoseNet can jointly handle person detection, keypoint detection, person segmentation and pose estimation problems. The novel assignment method is implemented by the Pose Residual Network (PRN) which receives keypoint and person detections, and produces accurate poses by assigning keypoints to person instances. On the COCO keypoints dataset, our pose estimation method outperforms all previous bottom-up methods both in accuracy (+4-point mAP over previous best result) and speed; it also performs on par with the best top-down methods while being at least 4x faster. Our method is the fastest real time system with 23 frames/sec. Source code is available at: https://github.com/mkocabas/pose-residual-network
CVJul 4, 2018Code
Localization Recall Precision (LRP): A New Performance Metric for Object DetectionKemal Oksuz, Baris Can Cam, Emre Akbas et al.
Average precision (AP), the area under the recall-precision (RP) curve, is the standard performance measure for object detection. Despite its wide acceptance, it has a number of shortcomings, the most important of which are (i) the inability to distinguish very different RP curves, and (ii) the lack of directly measuring bounding box localization accuracy. In this paper, we propose 'Localization Recall Precision (LRP) Error', a new metric which we specifically designed for object detection. LRP Error is composed of three components related to localization, false negative (FN) rate and false positive (FP) rate. Based on LRP, we introduce the 'Optimal LRP', the minimum achievable LRP error representing the best achievable configuration of the detector in terms of recall-precision and the tightness of the boxes. In contrast to AP, which considers precisions over the entire recall domain, Optimal LRP determines the 'best' confidence score threshold for a class, which balances the trade-off between localization and recall-precision. In our experiments, we show that, for state-of-the-art object (SOTA) detectors, Optimal LRP provides richer and more discriminative information than AP. We also demonstrate that the best confidence score thresholds vary significantly among classes and detectors. Moreover, we present LRP results of a simple online video object detector which uses a SOTA still image object detector and show that the class-specific optimized thresholds increase the accuracy against the common approach of using a general threshold for all classes. At https://github.com/cancam/LRP we provide the source code that can compute LRP for the PASCAL VOC and MSCOCO datasets. Our source code can easily be adapted to other datasets as well.
CVApr 10, 2024
MoCap-to-Visual Domain Adaptation for Efficient Human Mesh Estimation from 2D KeypointsBedirhan Uguz, Ozhan Suat, Batuhan Karagoz et al.
This paper presents Key2Mesh, a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh. Since this process does not involve any visual (i.e. RGB image) data, the model can be trained on large-scale motion capture (MoCap) datasets, thereby overcoming the scarcity of image datasets with 3D labels. To enable the model's application on RGB images, we first run an off-the-shelf 2D pose estimator to obtain the 2D keypoints, and then feed these 2D keypoints to Key2Mesh. To improve the performance of our model on RGB images, we apply an adversarial domain adaptation (DA) method to bridge the gap between the MoCap and visual domains. Crucially, our DA method does not require 3D labels for visual data, which enables adaptation to target sets without the need for costly labels. We evaluate Key2Mesh for the task of estimating 3D human meshes from 2D keypoints, in the absence of RGB and mesh label pairs. Our results on widely used H3.6M and 3DPW datasets show that Key2Mesh sets the new state-of-the-art by outperforming other models in PA-MPJPE for both datasets, and in MPJPE and PVE for the 3DPW dataset. Thanks to our model's simple architecture, it operates at least 12x faster than the prior state-of-the-art model, LGD. Additional qualitative samples and code are available on the project website: https://key2mesh.github.io/.
CVNov 20, 2025
Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster EnsemblesMelih Baydar, Emre Akbas
Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.
CVNov 17, 2025
RSPose: Ranking Based Losses for Human Pose EstimationMuhammed Can Keles, Bedrettin Cetinkaya, Sinan Kalkan et al.
While heatmap-based human pose estimation methods have shown strong performance, they suffer from three main problems: (P1) "Commonly used Mean Squared Error (MSE)" Loss may not always improve joint localization because it penalizes all pixel deviations equally, without focusing explicitly on sharpening and correctly localizing the peak corresponding to the joint; (P2) heatmaps are spatially and class-wise imbalanced; and, (P3) there is a discrepancy between the evaluation metric (i.e., mAP) and the loss functions. We propose ranking-based losses to address these issues. Both theoretically and empirically, we show that our proposed losses are superior to commonly used heatmap losses (MSE, KL-Divergence). Our losses considerably increase the correlation between confidence scores and localization qualities, which is desirable because higher correlation leads to more accurate instance selection during Non-Maximum Suppression (NMS) and better Average Precision (mAP) performance. We refer to the models trained with our losses as RSPose. We show the effectiveness of RSPose across two different modes: one-dimensional and two-dimensional heatmaps, on three different datasets (COCO, CrowdPose, MPII). To the best of our knowledge, we are the first to propose losses that align with the evaluation metric (mAP) for human pose estimation. RSPose outperforms the previous state of the art on the COCO-val set and achieves an mAP score of 79.9 with ViTPose-H, a vision transformer model for human pose estimation. We also improve SimCC Resnet-50, a coordinate classification-based pose estimation method, by 1.5 AP on the COCO-val set, achieving 73.6 AP.
CVAug 11, 2025
3D Human Mesh Estimation from Single View RGBDOzhan Suat, Bedirhan Uguz, Batuhan Karagoz et al.
Despite significant progress in 3D human mesh estimation from RGB images; RGBD cameras, offering additional depth data, remain underutilized. In this paper, we present a method for accurate 3D human mesh estimation from a single RGBD view, leveraging the affordability and widespread adoption of RGBD cameras for real-world applications. A fully supervised approach for this problem, requires a dataset with RGBD image and 3D mesh label pairs. However, collecting such a dataset is costly and challenging, hence, existing datasets are small, and limited in pose and shape diversity. To overcome this data scarcity, we leverage existing Motion Capture (MoCap) datasets. We first obtain complete 3D meshes from the body models found in MoCap datasets, and create partial, single-view versions of them by projection to a virtual camera. This simulates the depth data provided by an RGBD camera from a single viewpoint. Then, we train a masked autoencoder to complete the partial, single-view mesh. During inference, our method, which we name as M$^3$ for ``Masked Mesh Modeling'', matches the depth values coming from the sensor to vertices of a template human mesh, which creates a partial, single-view mesh. We effectively recover parts of the 3D human body mesh model that are not visible, resulting in a full body mesh. M$^3$ achieves 16.8 mm and 22.0 mm per-vertex-error (PVE) on the SURREAL and CAPE datasets, respectively; outperforming existing methods that use full-body point clouds as input. We obtain a competitive 70.9 PVE on the BEHAVE dataset, outperforming a recently published RGB based method by 18.4 mm, highlighting the usefulness of depth data. Code will be released.
CVJul 7, 2025
Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini ChallengeAlper Bahcekapili, Duygu Arslan, Umut Ozdemir et al.
Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using the publicly available METU CCTGS dataset. The dataset comprises 103 whole-slide images with expert pixel-level annotations for five tissue classes. Participants submitted segmentation masks via Codalab, evaluated using metrics such as macro F-score and mIoU. Among 39 participating teams, six outperformed the Swin Transformer baseline (62.92 F-score). This paper presents an overview of the challenge, dataset, and the top-performing methods
CVFeb 16, 2021
Just Noticeable Difference for Machine Perception and Generation of Regularized Adversarial Images with Minimal PerturbationAdil Kaan Akan, Emre Akbas, Fatos T. Yarman Vural
In this study, we introduce a measure for machine perception, inspired by the concept of Just Noticeable Difference (JND) of human perception. Based on this measure, we suggest an adversarial image generation algorithm, which iteratively distorts an image by an additive noise until the model detects the change in the image by outputting a false label. The noise added to the original image is defined as the gradient of the cost function of the model. A novel cost function is defined to explicitly minimize the amount of perturbation applied to the input image while enforcing the perceptual similarity between the adversarial and input images. For this purpose, the cost function is regularized by the well-known total variation and bounded range terms to meet the natural appearance of the adversarial image. We evaluate the adversarial images generated by our algorithm both qualitatively and quantitatively on CIFAR10, ImageNet, and MS COCO datasets. Our experiments on image classification and object detection tasks show that adversarial images generated by our JND method are both more successful in deceiving the recognition/detection models and less perturbed compared to the images generated by the state-of-the-art methods, namely, FGV, FSGM, and DeepFool methods.
CVFeb 4, 2021
A Deeper Look into Convolutions via Eigenvalue-based PruningIlke Cugu, Emre Akbas
Convolutional neural networks (CNNs) are able to attain better visual recognition performance than fully connected neural networks despite having much fewer parameters due to their parameter sharing principle. Modern architectures usually contain a small number of fully-connected layers, often at the end, after multiple layers of convolutions. In some cases, most of the convolutions can be eliminated without suffering any loss in recognition performance. However, there is no solid recipe to detect the hidden subset of convolutional neurons that is responsible for the majority of the recognition work. In this work, we formulate this as a pruning problem where the aim is to prune as many kernels as possible while preserving the vanilla generalization performance. To this end, we use the matrix characteristics based on eigenvalues for pruning, in comparison to the average absolute weight of a kernel which is the de facto standard in the literature to assess the importance of an individual convolutional kernel, to shed light on the internal mechanisms of a widely used family of CNNs, namely residual neural networks (ResNets), for the image classification problem using CIFAR-10, CIFAR-100 and Tiny ImageNet datasets.
LGApr 9, 2017
Supervised Infinite Feature SelectionSadegh Eskandari, Emre Akbas
In this paper, we present a new feature selection method that is suitable for both unsupervised and supervised problems. We build upon the recently proposed Infinite Feature Selection (IFS) method where feature subsets of all sizes (including infinity) are considered. We extend IFS in two ways. First, we propose a supervised version of it. Second, we propose new ways of forming the feature adjacency matrix that perform better for unsupervised problems. We extensively evaluate our methods on many benchmark datasets, including large image-classification datasets (PASCAL VOC), and show that our methods outperform both the IFS and the widely used "minimum-redundancy maximum-relevancy (mRMR)" feature selection algorithm.
CVAug 4, 2014
Object Detection Through Exploration With A Foveated Visual FieldEmre Akbas, Miguel P. Eckstein
We present a foveated object detector (FOD) as a biologically-inspired alternative to the sliding window (SW) approach which is the dominant method of search in computer vision object detection. Similar to the human visual system, the FOD has higher resolution at the fovea and lower resolution at the visual periphery. Consequently, more computational resources are allocated at the fovea and relatively fewer at the periphery. The FOD processes the entire scene, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. Our approach combines modern object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We assessed various eye movement strategies on the PASCAL VOC 2007 dataset and show that the FOD performs on par with the SW detector while bringing significant computational cost savings.