CVJul 5, 2022Code
SESS: Saliency Enhancing with Scaling and SlidingOsman Tursun, Simon Denman, Sridha Sridharan et al.
High-quality saliency maps are essential in several machine learning application areas including explainable AI and weakly supervised object detection and segmentation. Many techniques have been developed to generate better saliency using neural networks. However, they are often limited to specific saliency visualisation methods or saliency issues. We propose a novel saliency enhancing approach called SESS (Saliency Enhancing with Scaling and Sliding). It is a method and model agnostic extension to existing saliency map generation methods. With SESS, existing saliency approaches become robust to scale variance, multiple occurrences of target objects, presence of distractors and generate less noisy and more discriminative saliency maps. SESS improves saliency by fusing saliency maps extracted from multiple patches at different scales from different areas, and combines these individual maps using a novel fusion scheme that incorporates channel-wise weights and spatial weighted average. To improve efficiency, we introduce a pre-filtering step that can exclude uninformative saliency maps to improve efficiency while still enhancing overall results. We evaluate SESS on object recognition and detection benchmarks where it achieves significant improvement. The code is released publicly to enable researchers to verify performance and further development. Code is available at: https://github.com/neouyghur/SESS
CVAug 8, 2022
Vision-Based Activity Recognition in Children with Autism-Related BehaviorsPengbo Wei, David Ahmedt-Aristizabal, Harshala Gammulle et al.
Advances in machine learning and contactless sensors have enabled the understanding complex human behaviors in a healthcare setting. In particular, several deep learning systems have been introduced to enable comprehensive analysis of neuro-developmental conditions such as Autism Spectrum Disorder (ASD). This condition affects children from their early developmental stages onwards, and diagnosis relies entirely on observing the child's behavior and detecting behavioral cues. However, the diagnosis process is time-consuming as it requires long-term behavior observation, and the scarce availability of specialists. We demonstrate the effect of a region-based computer vision system to help clinicians and parents analyze a child's behavior. For this purpose, we adopt and enhance a dataset for analyzing autism-related actions using videos of children captured in uncontrolled environments (e.g. videos collected with consumer-grade cameras, in varied environments). The data is pre-processed by detecting the target child in the video to reduce the impact of background noise. Motivated by the effectiveness of temporal convolutional models, we propose both light-weight and conventional models capable of extracting action features from video frames and classifying autism-related behaviors by analyzing the relationships between frames in a video. Through extensive evaluations on the feature extraction and learning strategies, we demonstrate that the best performance is achieved with an Inflated 3D Convnet and Multi-Stage Temporal Convolutional Networks, achieving a 0.83 Weighted F1-score for classification of the three autism-related actions, outperforming existing methods. We also propose a light-weight solution by employing the ESNet backbone within the same system, achieving competitive results of 0.71 Weighted F1-score, and enabling potential deployment on embedded systems.
SPJun 19, 2023
Multi-task Learning for Radar Signal CharacterisationZi Huang, Akila Pemasiri, Simon Denman et al.
Radio signal recognition is a crucial task in both civilian and military applications, as accurate and timely identification of unknown signals is an essential part of spectrum management and electronic warfare. The majority of research in this field has focused on applying deep learning for modulation classification, leaving the task of signal characterisation as an understudied area. This paper addresses this gap by presenting an approach for tackling radar signal classification and characterisation as a multi-task learning (MTL) problem. We propose the IQ Signal Transformer (IQST) among several reference architectures that allow for simultaneous optimisation of multiple regression and classification tasks. We demonstrate the performance of our proposed MTL model on a synthetic radar dataset, while also providing a first-of-its-kind benchmark for radar signal characterisation.
CVApr 5, 2022
Towards On-Board Panoptic Segmentation of Multispectral Satellite ImagesTharindu Fernando, Clinton Fookes, Harshala Gammulle et al.
With tremendous advancements in low-power embedded computing devices and remote sensing instruments, the traditional satellite image processing pipeline which includes an expensive data transfer step prior to processing data on the ground is being replaced by on-board processing of captured data. This paradigm shift enables critical and time-sensitive analytic intelligence to be acquired in a timely manner on-board the satellite itself. However, at present, the on-board processing of multi-spectral satellite images is limited to classification and segmentation tasks. Extending this processing to its next logical level, in this paper we propose a lightweight pipeline for on-board panoptic segmentation of multi-spectral satellite images. Panoptic segmentation offers major economic and environmental insights, ranging from yield estimation from agricultural lands to intelligence for complex military applications. Nevertheless, the on-board intelligence extraction raises several challenges due to the loss of temporal observations and the need to generate predictions from a single image sample. To address this challenge, we propose a multimodal teacher network based on a cross-modality attention-based fusion strategy to improve the segmentation accuracy by exploiting data from multiple modes. We also propose an online knowledge distillation framework to transfer the knowledge learned by this multi-modal teacher network to a uni-modal student which receives only a single frame input, and is more appropriate for an on-board environment. We benchmark our approach against existing state-of-the-art panoptic segmentation models using the PASTIS multi-spectral panoptic segmentation dataset considering an on-board processing setting. Our evaluations demonstrate a substantial increase in accuracy metrics compared to the existing state-of-the-art models.
CVApr 5, 2023
Towards Self-Explainability of Deep Neural Networks with Heatmap Captioning and Large-Language ModelsOsman Tursun, Simon Denman, Sridha Sridharan et al.
Heatmaps are widely used to interpret deep neural networks, particularly for computer vision tasks, and the heatmap-based explainable AI (XAI) techniques are a well-researched topic. However, most studies concentrate on enhancing the quality of the generated heatmap or discovering alternate heatmap generation techniques, and little effort has been devoted to making heatmap-based XAI automatic, interactive, scalable, and accessible. To address this gap, we propose a framework that includes two modules: (1) context modelling and (2) reasoning. We proposed a template-based image captioning approach for context modelling to create text-based contextual information from the heatmap and input data. The reasoning module leverages a large language model to provide explanations in combination with specialised knowledge. Our qualitative experiments demonstrate the effectiveness of our framework and heatmap captioning approach. The code for the proposed template-based heatmap captioning approach will be publicly available.
IVDec 1, 2022
Automated Coronary Arteries Labeling Via Geometric Deep LearningYadan Li, Mohammad Ali Armin, Simon Denman et al.
Automatic labelling of anatomical structures, such as coronary arteries, is critical for diagnosis, yet existing (non-deep learning) methods are limited by a reliance on prior topological knowledge of the expected tree-like structures. As the structure such vascular systems is often difficult to conceptualize, graph-based representations have become popular due to their ability to capture the geometric and topological properties of the morphology in an orientation-independent and abstract manner. However, graph-based learning for automated labeling of tree-like anatomical structures has received limited attention in the literature. The majority of prior studies have limitations in the entity graph construction, are dependent on topological structures, and have limited accuracy due to the anatomical variability between subjects. In this paper, we propose an intuitive graph representation method, well suited to use with 3D coordinate data obtained from angiography scans. We subsequently seek to analyze subject-specific graphs using geometric deep learning. The proposed models leverage expert annotated labels from 141 patients to learn representations of each coronary segment, while capturing the effects of anatomical variability within the training data. We investigate different variants of so-called message passing neural networks. Through extensive evaluations, our pipeline achieves a promising weighted F1-score of 0.805 for labeling coronary artery (13 classes) for a five-fold cross-validation. Considering the ability of graph models in dealing with irregular data, and their scalability for data segmentation, this work highlights the potential of such methods to provide quantitative evidence to support the decisions of medical experts.
CVNov 12, 2025
HOTFLoc++: End-to-End Hierarchical LiDAR Place Recognition, Re-Ranking, and 6-DoF Metric Localisation in ForestsEthan Griffiths, Maryam Haghighat, Simon Denman et al.
This article presents HOTFLoc++, an end-to-end framework for LiDAR place recognition, re-ranking, and 6-DoF metric localisation in forests. Leveraging an octree-based transformer, our approach extracts hierarchical local descriptors at multiple granularities to increase robustness to clutter, self-similarity, and viewpoint changes in challenging scenarios, including ground-to-ground and ground-to-aerial in forest and urban environments. We propose a learnable multi-scale geometric verification module to reduce re-ranking failures in the presence of degraded single-scale correspondences. Our coarse-to-fine registration approach achieves comparable or lower localisation errors to baselines, with runtime improvements of two orders of magnitude over RANSAC for dense point clouds. Experimental results on public datasets show the superiority of our approach compared to state-of-the-art methods, achieving an average Recall@1 of 90.7% on CS-Wild-Places: an improvement of 29.6 percentage points over baselines, while maintaining high performance on single-source benchmarks with an average Recall@1 of 91.7% and 96.0% on Wild-Places and MulRan, respectively. Our method achieves under 2 m and 5 degrees error for 97.2% of 6-DoF registration attempts, with our multi-scale re-ranking module reducing localisation errors by ~2$\times$ on average. The code will be available upon acceptance.
CVDec 18, 2023
Deep Learning Approaches for Seizure Video Analysis: A ReviewDavid Ahmedt-Aristizabal, Mohammad Ali Armin, Zeeshan Hayder et al.
Seizure events can manifest as transient disruptions in the control of movements which may be organized in distinct behavioral sequences, accompanied or not by other observable features such as altered facial expressions. The analysis of these clinical signs, referred to as semiology, is subject to observer variations when specialists evaluate video-recorded events in the clinical setting. To enhance the accuracy and consistency of evaluations, computer-aided video analysis of seizures has emerged as a natural avenue. In the field of medical applications, deep learning and computer vision approaches have driven substantial advancements. Historically, these approaches have been used for disease detection, classification, and prediction using diagnostic data; however, there has been limited exploration of their application in evaluating video-based motion detection in the clinical epileptology setting. While vision-based technologies do not aim to replace clinical expertise, they can significantly contribute to medical decision-making and patient care by providing quantitative evidence and decision support. Behavior monitoring tools offer several advantages such as providing objective information, detecting challenging-to-observe events, reducing documentation efforts, and extending assessment capabilities to areas with limited expertise. The main applications of these could be (1) improved seizure detection methods; (2) refined semiology analysis for predicting seizure type and cerebral localization. In this paper, we detail the foundation technologies used in vision-based systems in the analysis of seizure videos, highlighting their success in semiology detection and analysis, focusing on work published in the last 7 years. Additionally, we illustrate how existing technologies can be interconnected through an integrated system for video-based semiology analysis.
LGDec 15, 2023
Multi-stage Learning for Radar Pulse Activity SegmentationZi Huang, Akila Pemasiri, Simon Denman et al.
Radio signal recognition is a crucial function in electronic warfare. Precise identification and localisation of radar pulse activities are required by electronic warfare systems to produce effective countermeasures. Despite the importance of these tasks, deep learning-based radar pulse activity recognition methods have remained largely underexplored. While deep learning for radar modulation recognition has been explored previously, classification tasks are generally limited to short and non-interleaved IQ signals, limiting their applicability to military applications. To address this gap, we introduce an end-to-end multi-stage learning approach to detect and localise pulse activities of interleaved radar signals across an extended time horizon. We propose a simple, yet highly effective multi-stage architecture for incrementally predicting fine-grained segmentation masks that localise radar pulse activities across multiple channels. We demonstrate the performance of our approach against several reference models on a novel radar dataset, while also providing a first-of-its-kind benchmark for radar pulse activity segmentation.
CVMay 8, 2025
Cross-Branch Orthogonality for Improved Generalization in Face Deepfake DetectionTharindu Fernando, Clinton Fookes, Sridha Sridharan et al.
Remarkable advancements in generative AI technology have given rise to a spectrum of novel deepfake categories with unprecedented leaps in their realism, and deepfakes are increasingly becoming a nuisance to law enforcement authorities and the general public. In particular, we observe alarming levels of confusion, deception, and loss of faith regarding multimedia content within society caused by face deepfakes, and existing deepfake detectors are struggling to keep up with the pace of improvements in deepfake generation. This is primarily due to their reliance on specific forgery artifacts, which limits their ability to generalise and detect novel deepfake types. To combat the spread of malicious face deepfakes, this paper proposes a new strategy that leverages coarse-to-fine spatial information, semantic information, and their interactions while ensuring feature distinctiveness and reducing the redundancy of the modelled features. A novel feature orthogonality-based disentanglement strategy is introduced to ensure branch-level and cross-branch feature disentanglement, which allows us to integrate multiple feature vectors without adding complexity to the feature space or compromising generalisation. Comprehensive experiments on three public benchmarks: FaceForensics++, Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design choices enable the proposed approach to outperform current state-of-the-art methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a cross-dataset evaluation setting.
CVFeb 11, 2025
PDV: Prompt Directional Vectors for Zero-shot Composed Image RetrievalOsman Tursun, Sinan Kalkan, Simon Denman et al.
Zero-shot Composed Image Retrieval (ZS-CIR) enables image search using a reference image and a text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches suffer from three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the \textbf{Prompt Directional Vector (PDV)}, a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) Dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be released upon publication.
LGJan 7, 2025
Few-Shot Radar Signal Recognition through Self-Supervised Learning and Radio Frequency Domain AdaptationZi Huang, Simon Denman, Akila Pemasiri et al.
Radar signal recognition (RSR) plays a pivotal role in electronic warfare (EW), as accurately classifying radar signals is critical for informing decision-making. Recent advances in deep learning have shown significant potential in improving RSR in domains with ample annotated data. However, these methods fall short in EW scenarios where annotated radio frequency (RF) data are scarce or impractical to obtain. To address these challenges, we introduce a self-supervised learning (SSL) method which utilises masked signal modelling and RF domain adaption to perform few-shot RSR and enhance performance in environments with limited RF samples and annotations. We propose a two-step approach, first pre-training masked autoencoders (MAE) on baseband in-phase and quadrature (I/Q) signals from diverse RF domains, and then transferring the learned representations to the radar domain, where annotated data are scarce. Empirical results show that our lightweight self-supervised ResNet1D model with domain adaptation achieves up to a 17.5% improvement in 1-shot classification accuracy when pre-trained on in-domain signals (i.e., radar signals) and up to a 16.31% improvement when pre-trained on out-of-domain signals (i.e., comm signals), compared to its baseline without using SSL. We also present reference results for several MAE designs and pre-training strategies, establishing a new benchmark for few-shot radar signal classification.
CVOct 9, 2025
Dual-Stream Alignment for Action SegmentationHarshala Gammulle, Clinton Fookes, Sridha Sridharan et al.
Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio-temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross-attention and Quantum-based Action-Guided Modulation (Q-ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing
CVMar 11, 2025
HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial ViewsEthan Griffiths, Maryam Haghighat, Simon Denman et al.
We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based TransFormer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. We propose an octree-based multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from spinning lidar, we present cylindrical octree attention windows to reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce CS-Wild-Places, a novel 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in CS-Wild-Places contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. HOTFormerLoc achieves a top-1 average recall improvement of 5.5% - 11.5% on the CS-Wild-Places benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 4.9% on well-established urban and forest datasets. The code and CS-Wild-Places benchmark is available at https://csiro-robotics.github.io/HOTFormerLoc.
LGMay 22, 2024
Part-based Quantitative Analysis for HeatmapsOsman Tursun, Sinan Kalkan, Simon Denman et al.
Heatmaps have been instrumental in helping understand deep network decisions, and are a common approach for Explainable AI (XAI). While significant progress has been made in enhancing the informativeness and accessibility of heatmaps, heatmap analysis is typically very subjective and limited to domain experts. As such, developing automatic, scalable, and numerical analysis methods to make heatmap-based XAI more objective, end-user friendly, and cost-effective is vital. In addition, there is a need for comprehensive evaluation metrics to assess heatmap quality at a granular level.
CVMay 19, 2023
Remembering What Is Important: A Factorised Multi-Head Retrieval and Auxiliary Memory Stabilisation Scheme for Human Motion PredictionTharindu Fernando, Harshala Gammulle, Sridha Sridharan et al.
Humans exhibit complex motions that vary depending on the task that they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting future poses based on the history of the previous motions is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework for the improved modelling of historical knowledge. Specifically, we disentangle subject-specific, task-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, our proposed dynamic masking strategy makes this feature disentanglement process dynamic. Two novel loss functions are introduced to encourage diversity within the auxiliary memory while ensuring the stability of the memory contents, such that it can locate and store salient information that can aid the long-term prediction of future motion, irrespective of data imbalances or the diversity of the input data distribution. With extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, we demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins: $>$ 17\% on the Human3.6M dataset and $>$ 9\% on the CMU-Mocap dataset.
CVFeb 26, 2022
Continuous Human Action Recognition for Human-Machine Interaction: A ReviewHarshala Gammulle, David Ahmedt-Aristizabal, Simon Denman et al.
With advances in data-driven machine learning research, a wide variety of prediction models have been proposed to capture spatio-temporal features for the analysis of video streams. Recognising actions and detecting action transitions within an input video are challenging but necessary tasks for applications that require real-time human-machine interaction. By reviewing a large body of recent related work in the literature, we thoroughly analyse, explain and compare action segmentation methods and provide details on the feature extraction and learning strategies that are used on most state-of-the-art methods. We cover the impact of the performance of object detection and tracking techniques on human action segmentation methodologies. We investigate the application of such models to real-world scenarios and discuss several limitations and key research directions towards improving interpretability, generalisation, optimisation and deployment.
CVFeb 22, 2022
Privacy-Preserving In-Bed Pose Monitoring: A Fusion and Reconstruction StudyThisun Dayarathna, Thamidu Muthukumarana, Yasiru Rathnayaka et al.
Recently, in-bed human pose estimation has attracted the interest of researchers due to its relevance to a wide range of healthcare applications. Compared to the general problem of human pose estimation, in-bed pose estimation has several inherent challenges, the most prominent being frequent and severe occlusions caused by bedding. In this paper we explore the effective use of images from multiple non-visual and privacy-preserving modalities such as depth, long-wave infrared (LWIR) and pressure maps for the task of in-bed pose estimation in two settings. First, we explore the effective fusion of information from different imaging modalities for better pose estimation. Secondly, we propose a framework that can estimate in-bed pose estimation when visible images are unavailable, and demonstrate the applicability of fusion methods to scenarios where only LWIR images are available. We analyze and demonstrate the effect of fusing features from multiple modalities. For this purpose, we consider four different techniques: 1) Addition, 2) Concatenation, 3) Fusion via learned modal weights, and 4) End-to-end fully trainable approach; with a state-of-the-art pose estimation model. We also evaluate the effect of reconstructing a data-rich modality (i.e., visible modality) from a privacy-preserving modality with data scarcity (i.e., long-wavelength infrared) for in-bed human pose estimation. For reconstruction, we use a conditional generative adversarial network. We conduct ablative studies across different design decisions of our framework. This includes selecting features with different levels of granularity, using different fusion techniques, and varying model parameters. Through extensive evaluations, we demonstrate that our method produces on par or better results compared to the state-of-the-art.
CVNov 30, 2021
In-Bed Human Pose Estimation from Unseen and Privacy-Preserving Image DomainsTing Cao, Mohammad Ali Armin, Simon Denman et al.
Medical applications have benefited greatly from the rapid advancement in computer vision. Considering patient monitoring in particular, in-bed human posture estimation offers important health-related metrics with potential value in medical condition assessments. Despite great progress in this domain, it remains challenging due to substantial ambiguity during occlusions, and the lack of large corpora of manually labeled data for model training, particularly with domains such as thermal infrared imaging which are privacy-preserving, and thus of great interest. Motivated by the effectiveness of self-supervised methods in learning features directly from data, we propose a multi-modal conditional variational autoencoder (MC-VAE) capable of reconstructing features from missing modalities seen during training. This approach is used with HRNet to enable single modality inference for in-bed pose estimation. Through extensive evaluations, we demonstrate that body positions can be effectively recognized from the available modality, achieving on par results with baseline models that are highly dependent on having access to multiple modes at inference time. The proposed framework supports future research towards self-supervised learning that generates a robust model from a single source, and expects it to generalize over many unknown distributions in clinical environments.
IVAug 9, 2021
Multi-Slice Net: A novel light weight framework for COVID-19 DiagnosisHarshala Gammulle, Tharindu Fernando, Sridha Sridharan et al.
This paper presents a novel lightweight COVID-19 diagnosis framework using CT scans. Our system utilises a novel two-stage approach to generate robust and efficient diagnoses across heterogeneous patient level inputs. We use a powerful backbone network as a feature extractor to capture discriminative slice-level features. These features are aggregated by a lightweight network to obtain a patient level diagnosis. The aggregation network is carefully designed to have a small number of trainable parameters while also possessing sufficient capacity to generalise to diverse variations within different CT volumes and to adapt to noise introduced during the data acquisition. We achieve a significant performance increase over the baselines when benchmarked on the SPGC COVID-19 Radiomics Dataset, despite having only 2.5 million trainable parameters and requiring only 0.623 seconds on average to process a single patient's CT volume using an Nvidia-GeForce RTX 2080 GPU.
LGJul 1, 2021
A Survey on Graph-Based Deep Learning for Computational HistopathologyDavid Ahmedt-Aristizabal, Mohammad Ali Armin, Simon Denman et al.
With the remarkable success of representation learning for prediction problems, we have witnessed a rapid expansion of the use of machine learning and deep learning for the analysis of digital pathology and biopsy image patches. However, learning over patch-wise features using convolutional neural networks limits the ability of the model to capture global contextual information and comprehensively model tissue composition. The phenotypical and topological distribution of constituent histological entities play a critical role in tissue diagnosis. As such, graph data representations and deep learning have attracted significant attention for encoding tissue representations, and capturing intra- and inter- entity level interactions. In this review, we provide a conceptual grounding for graph analytics in digital pathology, including entity-graph construction and graph architectures, and present their current success for tumor localization and classification, tumor invasion and staging, image retrieval, and survival prediction. We provide an overview of these methods in a systematic manner organized by the graph representation of the input image, scale, and organ on which they operate. We also outline the limitations of existing techniques, and suggest potential future research directions in this domain.
SDJun 30, 2021
Robust and Interpretable Temporal Convolution Network for Event Detection in Lung Sound RecordingsTharindu Fernando, Sridha Sridharan, Simon Denman et al.
This paper proposes a novel framework for lung sound event detection, segmenting continuous lung sound recordings into discrete events and performing recognition on each event. Exploiting the lightweight nature of Temporal Convolution Networks (TCNs) and their superior results compared to their recurrent counterparts, we propose a lightweight, yet robust, and completely interpretable framework for lung sound event detection. We propose the use of a multi-branch TCN architecture and exploit a novel fusion strategy to combine the resultant features from these branches. This not only allows the network to retain the most salient information across different temporal granularities and disregards irrelevant information, but also allows our network to process recordings of arbitrary length. Results: The proposed method is evaluated on multiple public and in-house benchmarks of irregular and noisy recordings of the respiratory auscultation process for the identification of numerous auscultation events including inhalation, exhalation, crackles, wheeze, stridor, and rhonchi. We exceed the state-of-the-art results in all evaluations. Furthermore, we empirically analyse the effect of the proposed multi-branch TCN architecture and the feature fusion strategy and provide quantitative and qualitative evaluations to illustrate their efficiency. Moreover, we provide an end-to-end model interpretation pipeline that interprets the operations of all the components of the proposed framework. Our analysis of different feature fusion strategies shows that the proposed feature concatenation method leads to better suppression of non-informative features, which drastically reduces the classifier overhead resulting in a robust lightweight network.The lightweight nature of our model allows it to be deployed in end-user devices such as smartphones, and it has the ability to generate predictions in real-time.
LGMay 27, 2021
Graph-Based Deep Learning for Medical Diagnosis and Analysis: Past, Present and FutureDavid Ahmedt-Aristizabal, Mohammad Ali Armin, Simon Denman et al.
With the advances of data-driven machine learning research, a wide variety of prediction problems have been tackled. It has become critical to explore how machine learning and specifically deep learning methods can be exploited to analyse healthcare data. A major limitation of existing methods has been the focus on grid-like data; however, the structure of physiological recordings are often irregular and unordered which makes it difficult to conceptualise them as a matrix. As such, graph neural networks have attracted significant attention by exploiting implicit information that resides in a biological system, with interactive nodes connected by edges whose weights can be either temporal associations or anatomical junctions. In this survey, we thoroughly review the different types of graph architectures and their applications in healthcare. We provide an overview of these methods in a systematic manner, organized by their domain of application including functional connectivity, anatomical structure and electrical-based analysis. We also outline the limitations of existing techniques and discuss potential directions for future research.
CVMay 27, 2021
Towards Interpretable Attention Networks for Cervical Cancer AnalysisRuiqi Wang, Mohammad Ali Armin, Simon Denman et al.
Recent advances in deep learning have enabled the development of automated frameworks for analysing medical images and signals, including analysis of cervical cancer. Many previous works focus on the analysis of isolated cervical cells, or do not offer sufficient methods to explain and understand how the proposed models reach their classification decisions on multi-cell images. Here, we evaluate various state-of-the-art deep learning models and attention-based frameworks for the classification of images of multiple cervical cells. As we aim to provide interpretable deep learning models to address this task, we also compare their explainability through the visualization of their gradients. We demonstrate the importance of using images that contain multiple cells over using isolated single-cell images. We show the effectiveness of the residual channel attention model for extracting important features from a group of cells, and demonstrate this model's efficiency for this classification task. This work highlights the benefits of channel attention mechanisms in analyzing multiple-cell images for potential relations and distributions within a group of cells. It also provides interpretable models to address the classification of cervical cells.
CVMay 27, 2021
Video-Based Inpatient Fall Risk Assessment: A Case StudyZiqing Wang, Mohammad Ali Armin, Simon Denman et al.
Inpatient falls are a serious safety issue in hospitals and healthcare facilities. Recent advances in video analytics for patient monitoring provide a non-intrusive avenue to reduce this risk through continuous activity monitoring. However, in-bed fall risk assessment systems have received less attention in the literature. The majority of prior studies have focused on fall event detection, and do not consider the circumstances that may indicate an imminent inpatient fall. Here, we propose a video-based system that can monitor the risk of a patient falling, and alert staff of unsafe behaviour to help prevent falls before they occur. We propose an approach that leverages recent advances in human localisation and skeleton pose estimation to extract spatial features from video frames recorded in a simulated environment. We demonstrate that body positions can be effectively recognised and provide useful evidence for fall risk assessment. This work highlights the benefits of video-based models for analysing behaviours of interest, and demonstrates how such a system could enable sufficient lead time for healthcare professionals to respond and address patient needs, which is necessary for the development of fall intervention programs.
CVApr 28, 2021
Semantic Consistency and Identity Mapping Multi-Component Generative Adversarial Network for Person Re-IdentificationAmena Khatun, Simon Denman, Sridha Sridharan et al.
In a real world environment, person re-identification (Re-ID) is a challenging task due to variations in lighting conditions, viewing angles, pose and occlusions. Despite recent performance gains, current person Re-ID algorithms still suffer heavily when encountering these variations. To address this problem, we propose a semantic consistency and identity mapping multi-component generative adversarial network (SC-IMGAN) which provides style adaptation from one to many domains. To ensure that transformed images are as realistic as possible, we propose novel identity mapping and semantic consistency losses to maintain identity across the diverse domains. For the Re-ID task, we propose a joint verification-identification quartet network which is trained with generated and real images, followed by an effective quartet loss for verification. Our proposed method outperforms state-of-the-art techniques on six challenging person Re-ID datasets: CUHK01, CUHK03, VIPeR, PRID2011, iLIDS and Market-1501.
CVApr 28, 2021
Pose-driven Attention-guided Image Generation for Person Re-IdentificationAmena Khatun, Simon Denman, Sridha Sridharan et al.
Person re-identification (re-ID) concerns the matching of subject images across different camera views in a multi camera surveillance system. One of the major challenges in person re-ID is pose variations across the camera network, which significantly affects the appearance of a person. Existing development data lack adequate pose variations to carry out effective training of person re-ID systems. To solve this issue, in this paper we propose an end-to-end pose-driven attention-guided generative adversarial network, to generate multiple poses of a person. We propose to attentively learn and transfer the subject pose through an attention mechanism. A semantic-consistency loss is proposed to preserve the semantic information of the person during pose transfer. To ensure fine image details are realistic after pose translation, an appearance discriminator is used while a pose discriminator is used to ensure the pose of the transferred images will exactly be the same as the target pose. We show that by incorporating the proposed approach in a person re-identification framework, realistic pose transferred images and state-of-the-art re-identification results can be achieved.
CVApr 15, 2021
Learning Regional Attention over Multi-resolution Deep Convolutional Features for Trademark RetrievalOsman Tursun, Simon Denman, Sridha Sridharan et al.
Large-scale trademark retrieval is an important content-based image retrieval task. A recent study shows that off-the-shelf deep features aggregated with Regional-Maximum Activation of Convolutions (R-MAC) achieve state-of-the-art results. However, R-MAC suffers in the presence of background clutter/trivial regions and scale variance, and discards important spatial information. We introduce three simple but effective modifications to R-MAC to overcome these drawbacks. First, we propose the use of both sum and max pooling to minimise the loss of spatial information. We also employ domain-specific unsupervised soft-attention to eliminate background clutter and unimportant regions. Finally, we add multi-resolution inputs to enhance the scale-invariance of R-MAC. We evaluate these three modifications on the million-scale METU dataset. Our results show that all modifications bring non-trivial improvements, and surpass previous state-of-the-art results.
CVFeb 8, 2021
An Efficient Framework for Zero-Shot Sketch-Based Image RetrievalOsman Tursun, Simon Denman, Sridha Sridharan et al.
Recently, Zero-shot Sketch-based Image Retrieval (ZS-SBIR) has attracted the attention of the computer vision community due to it's real-world applications, and the more realistic and challenging setting than found in SBIR. ZS-SBIR inherits the main challenges of multiple computer vision problems including content-based Image Retrieval (CBIR), zero-shot learning and domain adaptation. The majority of previous studies using deep neural networks have achieved improved results through either projecting sketch and images into a common low-dimensional space or transferring knowledge from seen to unseen classes. However, those approaches are trained with complex frameworks composed of multiple deep convolutional neural networks (CNNs) and are dependent on category-level word labels. This increases the requirements on training resources and datasets. In comparison, we propose a simple and efficient framework that does not require high computational training resources, and can be trained on datasets without semantic categorical labels. Furthermore, at training and inference stages our method only uses a single CNN. In this work, a pre-trained ImageNet CNN (e.g., ResNet50) is fine-tuned with three proposed learning objects: domain-aware quadruplet loss, semantic classification loss, and semantic knowledge preservation loss. The domain-aware quadruplet and semantic classification losses are introduced to learn discriminative, semantic and domain invariant features through considering ZS-SBIR as object detection and verification problem. ...
LGDec 4, 2020
Deep Learning for Medical Anomaly Detection -- A SurveyTharindu Fernando, Harshala Gammulle, Simon Denman et al.
Machine learning-based medical anomaly detection is an important problem that has been extensively studied. Numerous approaches have been proposed across various medical application domains and we observe several similarities across these distinct applications. Despite this comparability, we observe a lack of structured organisation of these diverse research applications such that their advantages and limitations can be studied. The principal aim of this survey is to provide a thorough theoretical analysis of popular deep learning techniques in medical anomaly detection. In particular, we contribute a coherent and systematic review of state-of-the-art techniques, comparing and contrasting their architectural differences as well as training algorithms. Furthermore, we provide a comprehensive overview of deep model interpretation strategies that can be used to interpret model decisions. In addition, we outline the key limitations of existing deep medical anomaly detection techniques and propose key research directions for further investigation.
CVDec 2, 2020
Sparse Convolutions on Continuous Domains for Point Cloud and Event Stream NetworksDominic Jack, Frederic Maire, Simon Denman et al.
Image convolutions have been a cornerstone of a great number of deep learning advances in computer vision. The research community is yet to settle on an equivalent operator for sparse, unstructured continuous data like point clouds and event streams however. We present an elegant sparse matrix-based interpretation of the convolution operator for these cases, which is consistent with the mathematical definition of convolution and efficient during training. On benchmark point cloud classification problems we demonstrate networks built with these operations can train an order of magnitude or more faster than top existing methods, whilst maintaining comparable accuracy and requiring a tiny fraction of the memory. We also apply our operator to event stream processing, achieving state-of-the-art results on multiple tasks with streams of hundreds of thousands of events.
CVNov 18, 2020
Patient-independent Epileptic Seizure Prediction using Deep Learning ModelsTheekshana Dissanayake, Tharindu Fernando, Simon Denman et al.
Objective: Epilepsy is one of the most prevalent neurological diseases among humans and can lead to severe brain injuries, strokes, and brain tumors. Early detection of seizures can help to mitigate injuries, and can be used to aid the treatment of patients with epilepsy. The purpose of a seizure prediction system is to successfully identify the pre-ictal brain stage, which occurs before a seizure event. Patient-independent seizure prediction models are designed to offer accurate performance across multiple subjects within a dataset, and have been identified as a real-world solution to the seizure prediction problem. However, little attention has been given for designing such models to adapt to the high inter-subject variability in EEG data. Methods: We propose two patient-independent deep learning architectures with different learning strategies that can learn a global function utilizing data from multiple subjects. Results: Proposed models achieve state-of-the-art performance for seizure prediction on the CHB-MIT-EEG dataset, demonstrating 88.81% and 91.54% accuracy respectively. Conclusions: The Siamese model trained on the proposed learning strategy is able to learn patterns related to patient variations in data while predicting seizures. Significance: Our models show superior performance for patient-independent seizure prediction, and the same architecture can be used as a patient-specific classifier after model adaptation. We are the first study that employs model interpretation to understand classifier behavior for the task for seizure prediction, and we also show that the MFCC feature map utilized by our models contains predictive biomarkers related to interictal and pre-ictal brain states.
CVNov 12, 2020
Domain Generalization in Biosignal ClassificationTheekshana Dissanayake, Tharindu Fernando, Simon Denman et al.
Objective: When training machine learning models, we often assume that the training data and evaluation data are sampled from the same distribution. However, this assumption is violated when the model is evaluated on another unseen but similar database, even if that database contains the same classes. This problem is caused by domain-shift and can be solved using two approaches: domain adaptation and domain generalization. Simply, domain adaptation methods can access data from unseen domains during training; whereas in domain generalization, the unseen data is not available during training. Hence, domain generalization concerns models that perform well on inaccessible, domain-shifted data. Method: Our proposed domain generalization method represents an unseen domain using a set of known basis domains, afterwhich we classify the unseen domain using classifier fusion. To demonstrate our system, we employ a collection of heart sound databases that contain normal and abnormal sounds (classes). Results: Our proposed classifier fusion method achieves accuracy gains of up to 16% for four completely unseen domains. Conclusion: Recognizing the complexity induced by the inherent temporal nature of biosignal data, the two-stage method proposed in this study is able to effectively simplify the whole process of domain generalization while demonstrating good results on unseen domains and the adopted basis domains. Significance: To our best knowledge, this is the first study that investigates domain generalization for biosignal data. Our proposed learning strategy can be used to effectively learn domain-relevant features while being aware of the class differences in the data.
CVNov 10, 2020
Fast & Slow Learning: Incorporating Synthetic Gradients in Neural Memory ControllersTharindu Fernando, Simon Denman, Sridha Sridharan et al.
Neural Memory Networks (NMNs) have received increased attention in recent years compared to deep architectures that use a constrained memory. Despite their new appeal, the success of NMNs hinges on the ability of the gradient-based optimiser to perform incremental training of the NMN controllers, determining how to leverage their high capacity for knowledge retrieval. This means that while excellent performance can be achieved when the training data is consistent and well distributed, rare data samples are hard to learn from as the controllers fail to incorporate them effectively during model training. Drawing inspiration from the human cognition process, in particular the utilisation of neuromodulators in the human brain, we propose to decouple the learning process of the NMN controllers to allow them to achieve flexible, rapid adaptation in the presence of new information. This trait is highly beneficial for meta-learning tasks where the memory controllers must quickly grasp abstract concepts in the target domain, and adapt stored knowledge. This allows the NMN controllers to quickly determine which memories are to be retained and which are to be erased, and swiftly adapt their strategy to the new task at hand. Through both quantitative and qualitative evaluations on multiple public benchmarks, including classification and regression tasks, we demonstrate the utility of the proposed approach. Our evaluations not only highlight the ability of the proposed NMN architecture to outperform the current state-of-the-art methods, but also provide insights on how the proposed augmentations help achieve such superior results. In addition, we demonstrate the practical implications of the proposed learning strategy, where the feedback path can be shared among multiple neural memory networks as a mechanism for knowledge sharing.
CVNov 10, 2020
Multi-modal Fusion for Single-Stage Continuous Gesture RecognitionHarshala Gammulle, Simon Denman, Sridha Sridharan et al.
Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have focused on recognising isolated gestures, and existing continuous gesture recognition methods are limited to two-stage approaches where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that can detect and classify multiple gestures in a video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step to detect individual gestures. To achieve this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance performance, we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction, helping the model to learn natural gesture transitions. We demonstrate the utility of our proposed framework, which can handle variable-length input videos, and outperforms the state-of-the-art on three challenging datasets: EgoGesture, IPN hand, and ChaLearn LAP Continuous Gesture Dataset (ConGD). Furthermore, ablation experiments show the importance of different components of the proposed framework.
ASSep 23, 2020
Attention Driven Fusion for Multi-Modal Emotion RecognitionDarshana Priyasad, Tharindu Fernando, Simon Denman et al.
Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition on combined acoustic and text modalities. Baseline systems model emotion information in text and acoustic modes independently using Deep Convolutional Neural Networks (DCNN) and Recurrent Neural Networks (RNN), followed by applying attention, fusion, and classification. In this paper, we present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification. We utilize a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN. This approach learns filter banks tuned for emotion recognition and provides more effective features compared to directly applying convolutions over the raw speech signal. For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations on hidden representations received from the Bi-RNN. Following existing state-of-the-art, we evaluate the performance of the proposed system on the IEMOCAP dataset. Experimental results indicate that the proposed system outperforms existing methods, achieving 3.5% improvement in weighted accuracy.
LGJul 16, 2020
Memory based fusion for multi-modal deep learningDarshana Priyasad, Tharindu Fernando, Simon Denman et al.
The use of multi-modal data for deep machine learning has shown promise when compared to uni-modal approaches with fusion of multi-modal features resulting in improved performance in several applications. However, most state-of-the-art methods use naive fusion which processes feature streams independently, ignoring possible long-term dependencies within the data during fusion. In this paper, we present a novel Memory based Attentive Fusion layer, which fuses modes by incorporating both the current features and longterm dependencies in the data, thus allowing the model to understand the relative importance of modes over time. We introduce an explicit memory block within the fusion layer which stores features containing long-term dependencies of the fused data. The feature inputs from uni-modal encoders are fused through attentive composition and transformation followed by naive fusion of the resultant memory derived features with layer inputs. Following state-of-the-art methods, we have evaluated the performance and the generalizability of the proposed fusion approach on two different datasets with different modalities. In our experiments, we replace the naive fusion layer in benchmark networks with our proposed layer to enable a fair comparison. Experimental results indicate that the MBAF layer can generalise across different modalities and networks to enhance fusion and improve performance.
CVJul 12, 2020
Two-Stream Deep Feature Modelling for Automated Video Endoscopy Data AnalysisHarshala Gammulle, Simon Denman, Sridha Sridharan et al.
Automating the analysis of imagery of the Gastrointestinal (GI) tract captured during endoscopy procedures has substantial potential benefits for patients, as it can provide diagnostic support to medical practitioners and reduce mistakes via human error. To further the development of such methods, we propose a two-stream model for endoscopic image analysis. Our model fuses two streams of deep feature inputs by mapping their inherent relations through a novel relational network model, to better model symptoms and classify the image. In contrast to handcrafted feature-based models, our proposed network is able to learn features automatically and outperforms existing state-of-the-art methods on two public datasets: KVASIR and Nerthus. Our extensive evaluations illustrate the importance of having two streams of inputs instead of a single stream and also demonstrates the merits of the proposed relational network architecture to combine those streams.
CVJun 23, 2020
Meta Transfer Learning for Emotion RecognitionDung Nguyen, Sridha Sridharan, Duc Thanh Nguyen et al.
Deep learning has been widely adopted in automatic emotion recognition and has lead to significant progress in the field. However, due to insufficient annotated emotion datasets, pre-trained models are limited in their generalization capability and thus lead to poor performance on novel test sets. To mitigate this challenge, transfer learning performing fine-tuning on pre-trained models has been applied. However, the fine-tuned knowledge may overwrite and/or discard important knowledge learned from pre-trained models. In this paper, we address this issue by proposing a PathNet-based transfer learning method that is able to transfer emotional knowledge learned from one visual/audio emotion domain to another visual/audio emotion domain, and transfer the emotional knowledge learned from multiple audio emotion domains into one another to improve overall emotion recognition accuracy. To show the robustness of our proposed system, various sets of experiments for facial expression recognition and speech emotion recognition task on three emotion datasets: SAVEE, EMODB, and eNTERFACE have been carried out. The experimental results indicate that our proposed system is capable of improving the performance of emotion recognition, making its performance substantially superior to the recent proposed fine-tuning/pre-trained models based transfer learning methods.
SDMay 21, 2020
A Robust Interpretable Deep Learning Classifier for Heart Anomaly Detection Without SegmentationTheekshana Dissanayake, Tharindu Fernando, Simon Denman et al.
Traditionally, abnormal heart sound classification is framed as a three-stage process. The first stage involves segmenting the phonocardiogram to detect fundamental heart sounds; after which features are extracted and classification is performed. Some researchers in the field argue the segmentation step is an unwanted computational burden, whereas others embrace it as a prior step to feature extraction. When comparing accuracies achieved by studies that have segmented heart sounds before analysis with those who have overlooked that step, the question of whether to segment heart sounds before feature extraction is still open. In this study, we explicitly examine the importance of heart sound segmentation as a prior step for heart sound classification, and then seek to apply the obtained insights to propose a robust classifier for abnormal heart sound detection. Furthermore, recognizing the pressing need for explainable Artificial Intelligence (AI) models in the medical domain, we also unveil hidden representations learned by the classifier using model interpretation techniques. Experimental results demonstrate that the segmentation plays an essential role in abnormal heart sound classification. Our new classifier is also shown to be robust, stable and most importantly, explainable, with an accuracy of almost 100% on the widely used PhysioNet dataset.
CVMay 7, 2020
End-to-End Domain Adaptive Attention Network for Cross-Domain Person Re-IdentificationAmena Khatun, Simon Denman, Sridha Sridharan et al.
Person re-identification (re-ID) remains challenging in a real-world scenario, as it requires a trained network to generalise to totally unseen target data in the presence of variations across domains. Recently, generative adversarial models have been widely adopted to enhance the diversity of training data. These approaches, however, often fail to generalise to other domains, as existing generative person re-identification models have a disconnect between the generative component and the discriminative feature learning stage. To address the on-going challenges regarding model generalisation, we propose an end-to-end domain adaptive attention network to jointly translate images between domains and learn discriminative re-id features in a single framework. To address the domain gap challenge, we introduce an attention module for image translation from source to target domains without affecting the identity of a person. More specifically, attention is directed to the background instead of the entire image of the person, ensuring identifying characteristics of the subject are preserved. The proposed joint learning network results in a significant performance improvement over state-of-the-art methods on several benchmark datasets.
CVMay 7, 2020
Hierarchical Attention Network for Action SegmentationHarshala Gammulle, Simon Denman, Sridha Sridharan et al.
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. Several attempts have been made to capture frame-level salient aspects through attention but they lack the capacity to effectively map the temporal relationships in between the frames as they only capture a limited span of temporal dependencies. To this end we propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time, thus improving the overall segmentation performance. The proposed hierarchical recurrent attention framework analyses the input video at multiple temporal scales, to form embeddings at frame level and segment level, and perform fine-grained action segmentation. This generates a simple, lightweight, yet extremely effective architecture for segmenting continuous video streams and has multiple application domains. We evaluate our system on multiple challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets, and achieves state-of-the-art performance. The evaluated datasets encompass numerous video capture settings which are inclusive of static overhead camera views and dynamic, ego-centric head-mounted camera views, demonstrating the direct applicability of the proposed framework in a variety of settings.
ASApr 2, 2020
Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity DetectionTharindu Fernando, Sridha Sridharan, Mitchell McLaren et al.
This paper presents a novel framework for Speech Activity Detection (SAD). Inspired by the recent success of multi-task learning approaches in the speech processing domain, we propose a novel joint learning framework for SAD. We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment. In order to exploit the temporal relationships within the input signal, we propose a temporal discriminator which aims to ensure that the predicted signal is temporally consistent. We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT' 17, AMI Meeting and HAVIC, where we demonstrate its capability to outperform state-of-the-art SAD approaches. Furthermore, our cross-database evaluations demonstrate the robustness of the proposed approach across different languages, accents, and acoustic environments.
SPApr 2, 2020
Heart Sound Segmentation using Bidirectional LSTMs with AttentionTharindu Fernando, Houman Ghaemmaghami, Simon Denman et al.
This paper proposes a novel framework for the segmentation of phonocardiogram (PCG) signals into heart states, exploiting the temporal evolution of the PCG as well as considering the salient information that it provides for the detection of the heart state. We propose the use of recurrent neural networks and exploit recent advancements in attention based learning to segment the PCG signal. This allows the network to identify the most salient aspects of the signal and disregard uninformative information. The proposed method attains state-of-the-art performance on multiple benchmarks including both human and animal heart recordings. Furthermore, we empirically analyse different feature combinations including envelop features, wavelet and Mel Frequency Cepstral Coefficients (MFCC), and provide quantitative measurements that explore the importance of different features in the proposed approach. We demonstrate that a recurrent neural network coupled with attention mechanisms can effectively learn from irregular and noisy PCG recordings. Our analysis of different feature combinations shows that MFCC features and their derivatives offer the best performance compared to classical wavelet and envelop features. Heart sound segmentation is a crucial pre-processing step for many diagnostic applications. The proposed method provides a cost effective alternative to labour extensive manual segmentation, and provides a more accurate segmentation than existing methods. As such, it can improve the performance of further analysis including the detection of murmurs and ejection clicks. The proposed method is also applicable for detection and segmentation of other one dimensional biomedical signals.
CVMar 24, 2020
Joint Deep Cross-Domain Transfer Learning for Emotion RecognitionDung Nguyen, Sridha Sridharan, Duc Thanh Nguyen et al.
Deep learning has been applied to achieve significant progress in emotion recognition. Despite such substantial progress, existing approaches are still hindered by insufficient training data, and the resulting models do not generalize well under mismatched conditions. To address this challenge, we propose a learning strategy which jointly transfers the knowledge learned from rich datasets to source-poor datasets. Our method is also able to learn cross-domain features which lead to improved recognition performance. To demonstrate the robustness of our proposed framework, we conducted experiments on three benchmark emotion datasets including eNTERFACE, SAVEE, and EMODB. Experimental results show that the proposed method surpassed state-of-the-art transfer learning schemes by a significant margin.
CVFeb 5, 2020
Learning Test-time Augmentation for Content-based Image RetrievalOsman Tursun, Simon Denman, Sridha Sridharan et al.
Off-the-shelf convolutional neural network features achieve outstanding results in many image retrieval tasks. However, their invariance to target data is pre-defined by the network architecture and training data. Existing image retrieval approaches require fine-tuning or modification of pre-trained networks to adapt to variations unique to the target data. In contrast, our method enhances the invariance of off-the-shelf features by aggregating features extracted from images augmented at test-time, with augmentations guided by a policy learned through reinforcement learning. The learned policy assigns different magnitudes and weights to the selected transformations, which are selected from a list of image transformations. Policies are evaluated using a metric learning protocol to learn the optimal policy. The model converges quickly and the cost of each policy iteration is minimal as we propose an off-line caching technique to greatly reduce the computational cost of extracting features from augmented images. Experimental results on large trademark retrieval (METU trademark dataset) and landmark retrieval (ROxford5k and RParis6k scene datasets) tasks show that the learned ensemble of transformations is highly effective for improving performance, and is practical, and transferable.
CVDec 16, 2019
MTRNet++: One-stage Mask-based Scene Text EraserOsman Tursun, Simon Denman, Rui Zeng et al.
A precise, controllable, interpretable and easily trainable text removal approach is necessary for both user-specific and large-scale text removal applications. To achieve this, we propose a one-stage mask-based text inpainting network, MTRNet++. It has a novel architecture that includes mask-refine, coarse-inpainting and fine-inpainting branches, and attention blocks. With this architecture, MTRNet++ can remove text either with or without an external mask. It achieves state-of-the-art results on both the Oxford and SCUT datasets without using external ground-truth masks. The results of ablation studies demonstrate that the proposed multi-branch architecture with attention blocks is effective and essential. It also demonstrates controllability and interpretability.
CVDec 16, 2019
Predicting the Future: A Jointly Learnt Model for Action AnticipationHarshala Gammulle, Simon Denman, Sridha Sridharan et al.
Inspired by human neurological structures for action anticipation, we present an action anticipation model that enables the prediction of plausible future actions by forecasting both the visual and temporal future. In contrast to current state-of-the-art methods which first learn a model to predict future video features and then perform action anticipation using these features, the proposed framework jointly learns to perform the two tasks, future visual and temporal representation synthesis, and early action anticipation. The joint learning framework ensures that the predicted future embeddings are informative to the action anticipation task. Furthermore, through extensive experimental evaluations we demonstrate the utility of using both visual and temporal semantics of the scene, and illustrate how this representation synthesis could be achieved through a recurrent Generative Adversarial Network (GAN) framework. Our model outperforms the current state-of-the-art methods on multiple datasets: UCF101, UCF101-24, UT-Interaction and TV Human Interaction.
SPDec 10, 2019
Neural Memory Networks for Seizure Type ClassificationDavid Ahmedt-Aristizabal, Tharindu Fernando, Simon Denman et al.
Classification of seizure type is a key step in the clinical process for evaluating an individual who presents with seizures. It determines the course of clinical diagnosis and treatment, and its impact stretches beyond the clinical domain to epilepsy research and the development of novel therapies. Automated identification of seizure type may facilitate understanding of the disease, and seizure detection and prediction has been the focus of recent research that has sought to exploit the benefits of machine learning and deep learning architectures. Nevertheless, there is not yet a definitive solution for automating the classification of seizure type, a task that must currently be performed by an expert epileptologist. Inspired by recent advances in neural memory networks (NMNs), we introduce a novel approach for the classification of seizure type using electrophysiological data. We first explore the performance of traditional deep learning techniques which use convolutional and recurrent neural networks, and enhance these architectures by using external memory modules with trainable neural plasticity. We show that our model achieves a state-of-the-art weighted F1 score of 0.945 for seizure type classification on the TUH EEG Seizure Corpus with the IBM TUSZ preprocessed data. This work highlights the potential of neural memory networks to support the field of epilepsy research, along with biomedical research and signal analysis more broadly.
CVNov 17, 2019
Exploiting Human Social Cognition for the Detection of Fake and Fraudulent Faces via Memory NetworksTharindu Fernando, Clinton Fookes, Simon Denman et al.
Advances in computer vision have brought us to the point where we have the ability to synthesise realistic fake content. Such approaches are seen as a source of disinformation and mistrust, and pose serious concerns to governments around the world. Convolutional Neural Networks (CNNs) demonstrate encouraging results when detecting fake images that arise from the specific type of manipulation they are trained on. However, this success has not transitioned to unseen manipulation types, resulting in a significant gap in the line-of-defense. We propose a Hierarchical Memory Network (HMN) architecture, which is able to successfully detect faked faces by utilising knowledge stored in neural memories as well as visual cues to reason about the perceived face and anticipate its future semantic embeddings. This renders a generalisable face tampering detection framework. Experimental results demonstrate the proposed approach achieves superior performance for fake and fraudulent face detection compared to the state-of-the-art.