CVMar 13, 2023
Transformation-Invariant Network for Few-Shot Object Detection in Remote Sensing ImagesNanqing Liu, Xun Xu, Turgay Celik et al.
Object detection in remote sensing images relies on a large amount of labeled data for training. However, the increasing number of new categories and class imbalance make exhaustive annotation impractical. Few-shot object detection (FSOD) addresses this issue by leveraging meta-learning on seen base classes and fine-tuning on novel classes with limited labeled samples. Nonetheless, the substantial scale and orientation variations of objects in remote sensing images pose significant challenges to existing few-shot object detection methods. To overcome these challenges, we propose integrating a feature pyramid network and utilizing prototype features to enhance query features, thereby improving existing FSOD methods. We refer to this modified FSOD approach as a Strong Baseline, which has demonstrated significant performance improvements compared to the original baselines. Furthermore, we tackle the issue of spatial misalignment caused by orientation variations between the query and support images by introducing a Transformation-Invariant Network (TINet). TINet ensures geometric invariance and explicitly aligns the features of the query and support branches, resulting in additional performance gains while maintaining the same inference speed as the Strong Baseline. Extensive experiments on three widely used remote sensing object detection datasets, i.e., NWPU VHR-10.v2, DIOR, and HRRSD demonstrated the effectiveness of the proposed method.
CLJun 26, 2023
Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship AttributionAbiodun Modupe, Turgay Celik, Vukosi Marivate et al.
The problem of unveiling the author of a given text document from multiple candidate authors is called authorship attribution. Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution. Unfortunately, the performance of word-based authorship attribution systems is limited by the vocabulary of the training corpus. Literature has recommended character-based stylistic markers as an alternative to overcome the hidden word problem. However, character-based methods often fail to capture the sequential relationship of words in texts which is a chasm for further improvement. The question addressed in this paper is whether it is possible to address the ambiguity of hidden words in text documents while preserving the sequential context of words. Consequently, a method based on bidirectional long short-term memory (BLSTM) with a 2-dimensional convolutional neural network (CNN) is proposed to capture sequential writing styles for authorship attribution. The BLSTM was used to obtain the sequential relationship among characteristics using subword information. The 2-dimensional CNN was applied to understand the local syntactical position of the style from unlabeled input text. The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50. Experimental results indicate accuracy improvement of 1.07\%, and 0.96\% on CCAT50 and Twitter, respectively, and produce comparable results on the remaining datasets.
CVMay 3, 2024Code
FER-YOLO-Mamba: Facial Expression Detection and Classification Based on Selective State SpaceHui Ma, Sen Lei, Turgay Celik et al.
Facial Expression Recognition (FER) plays a pivotal role in understanding human emotional cues. However, traditional FER methods based on visual information have some limitations, such as preprocessing, feature extraction, and multi-stage classification procedures. These not only increase computational complexity but also require a significant amount of computing resources. Considering Convolutional Neural Network (CNN)-based FER schemes frequently prove inadequate in identifying the deep, long-distance dependencies embedded within facial expression images, and the Transformer's inherent quadratic computational complexity, this paper presents the FER-YOLO-Mamba model, which integrates the principles of Mamba and YOLO technologies to facilitate efficient coordination in facial expression image recognition and localization. Within the FER-YOLO-Mamba model, we further devise a FER-YOLO-VSS dual-branch module, which combines the inherent strengths of convolutional layers in local feature extraction with the exceptional capability of State Space Models (SSMs) in revealing long-distance dependencies. To the best of our knowledge, this is the first Vision Mamba model designed for facial expression detection and classification. To evaluate the performance of the proposed FER-YOLO-Mamba model, we conducted experiments on two benchmark datasets, RAF-DB and SFEW. The experimental results indicate that the FER-YOLO-Mamba model achieved better results compared to other models. The code is available from https://github.com/SwjtuMa/FER-YOLO-Mamba.
CLNov 25, 2025
Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and SetswanaKoena Ronny Mabokela, Tim Schlippe, Mpho Raborife et al.
Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.
LGJul 30, 2025
FedCVD++: Communication-Efficient Federated Learning for Cardiovascular Risk Prediction with Parametric and Non-Parametric Model OptimizationAbdelrhman Gaber, Hassan Abd-Eltawab, John Elgallab et al.
Cardiovascular diseases (CVD) cause over 17 million deaths annually worldwide, highlighting the urgent need for privacy-preserving predictive systems. We introduce FedCVD++, an enhanced federated learning (FL) framework that integrates both parametric models (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) for coronary heart disease risk prediction. To address key FL challenges, we propose: (1) tree-subset sampling that reduces Random Forest communication overhead by 70%, (2) XGBoost-based feature extraction enabling lightweight federated ensembles, and (3) federated SMOTE synchronization for resolving cross-institutional class imbalance. Evaluated on the Framingham dataset (4,238 records), FedCVD++ achieves state-of-the-art results: federated XGBoost (F1 = 0.80) surpasses its centralized counterpart (F1 = 0.78), and federated Random Forest (F1 = 0.81) matches non-federated performance. Additionally, our communication-efficient strategies reduce bandwidth consumption by 3.2X while preserving 95% accuracy. Compared to existing FL frameworks, FedCVD++ delivers up to 15% higher F1-scores and superior scalability for multi-institutional deployment. This work represents the first practical integration of non-parametric models into federated healthcare systems, providing a privacy-preserving solution validated under real-world clinical constraints.
CVApr 19, 2025
Exploring Generalizable Pre-training for Real-world Change Detection via Geometric EstimationYitao Zhao, Sen Lei, Nanqing Liu et al.
As an essential procedure in earth observation system, change detection (CD) aims to reveal the spatial-temporal evolution of the observation regions. A key prerequisite for existing change detection algorithms is aligned geo-references between multi-temporal images by fine-grained registration. However, in the majority of real-world scenarios, a prior manual registration is required between the original images, which significantly increases the complexity of the CD workflow. In this paper, we proposed a self-supervision motivated CD framework with geometric estimation, called "MatchCD". Specifically, the proposed MatchCD framework utilizes the zero-shot capability to optimize the encoder with self-supervised contrastive representation, which is reused in the downstream image registration and change detection to simultaneously handle the bi-temporal unalignment and object change issues. Moreover, unlike the conventional change detection requiring segmenting the full-frame image into small patches, our MatchCD framework can directly process the original large-scale image (e.g., 6K*4K resolutions) with promising performance. The performance in multiple complex scenarios with significant geometric distortion demonstrates the effectiveness of our proposed framework.
CVMay 28, 2021
The Wits Intelligent Teaching System: Detecting Student Engagement During Lectures Using Convolutional Neural NetworksRichard Klein, Turgay Celik
To perform contingent teaching and be responsive to students' needs during class, lecturers must be able to quickly assess the state of their audience. While effective teachers are able to gauge easily the affective state of the students, as class sizes grow this becomes increasingly difficult and less precise. The Wits Intelligent Teaching System (WITS) aims to assist lecturers with real-time feedback regarding student affect. The focus is primarily on recognising engagement or lack thereof. Student engagement is labelled based on behaviour and postures that are common to classroom settings. These proxies are then used in an observational checklist to construct a dataset of engagement upon which a CNN based on AlexNet is successfully trained and which significantly outperforms a Support Vector Machine approach. The deep learning approach provides satisfactory results on a challenging, real-world dataset with significant occlusion, lighting and resolution constraints.
CVFeb 19, 2020
SummaryNet: A Multi-Stage Deep Learning Model for Automatic Video SummarisationZiyad Jappie, David Torpey, Turgay Celik
Video summarisation can be posed as the task of extracting important parts of a video in order to create an informative summary of what occurred in the video. In this paper we introduce SummaryNet as a supervised learning framework for automated video summarisation. SummaryNet employs a two-stream convolutional network to learn spatial (appearance) and temporal (motion) representations. It utilizes an encoder-decoder model to extract the most salient features from the learned video representations. Lastly, it uses a sigmoid regression network with bidirectional long short-term memory cells to predict the probability of a frame being a summary frame. Experimental results on benchmark datasets show that the proposed method achieves comparable or significantly better results than the state-of-the-art video summarisation methods.
CVFeb 19, 2020
Human Action Recognition using Local Two-Stream Convolution Neural Network Features and Support Vector MachinesDavid Torpey, Turgay Celik
This paper proposes a simple yet effective method for human action recognition in video. The proposed method separately extracts local appearance and motion features using state-of-the-art three-dimensional convolutional neural networks from sampled snippets of a video. These local features are then concatenated to form global representations which are then used to train a linear SVM to perform the action classification using full context of the video, as partial context as used in previous works. The videos undergo two simple proposed preprocessing techniques, optical flow scaling and crop filling. We perform an extensive evaluation on three common benchmark dataset to empirically show the benefit of the SVM, and the two preprocessing steps.