Syed Mohammed Shamsul Islam

CV
h-index10
5papers
62citations
Novelty33%
AI Score37

5 Papers

CVAug 1, 2023
MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam et al.

In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes.

CVSep 11, 2022
MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam et al.

Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. To this end, we propose Multimodal Audio-Image and Video Action Recognizer (MAiVAR), a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance. MAiVAR extracts meaningful image representations of audio and fuses it with video representation to achieve better performance as compared to both modalities individually on a large-scale action recognition dataset.

22.7CRMay 6
Assessing Generalisation Capability of Machine Learning Models for Intrusion Detection

Md Zakir Hossain, Md Ayshik Rahman Khan, Md Rafiqul Islam et al.

The growth of networked and IoT systems has intensified cyber-security threats and exposed the limits of traditional signature-based intrusion detection. Although machine-learning-based intrusion detection systems often report strong benchmark performance, high ac- curacy within a single dataset does not necessarily guarantee reliable performance in unseen network environments. This study investigates the generalisation capability of supervised machine learning models for intrusion detection using UNSW-NB15 and TON_IoT. Random Forest, Logistic Regression, and Naive Bayes were evaluated under same-dataset and cross-dataset settings. Random Forest achieved the strongest same dataset performance, with 95.08% accuracy on UNSW-NB15 and 99.79% on TON_IoT, but performance dropped sharply in cross-dataset testing. When trained on UNSW-NB15 and tested on TON_IoT or vice versa, below 40% accuracy. These results reveal a significant generalisation gap in intrusion detection. We connect this challenge to affective computing and human-centric AI, where behavioural signal analysis, anomaly detection, domain shift, and context-sensitive modelling are also central. This framing highlights the need for adaptive, generalisable cyber-security models that can operate across changing network and IoT environments.

CVSep 21, 2020Code
Exploring Intensity Invariance in Deep Neural Networks for Brain Image Registration

Hassan Mahmood, Asim Iqbal, Syed Mohammed Shamsul Islam

Image registration is a widely-used technique in analysing large scale datasets that are captured through various imaging modalities and techniques in biomedical imaging such as MRI, X-Rays, etc. These datasets are typically collected from various sites and under different imaging protocols using a variety of scanners. Such heterogeneity in the data collection process causes inhomogeneity or variation in intensity (brightness) and noise distribution. These variations play a detrimental role in the performance of image registration, segmentation and detection algorithms. Classical image registration methods are computationally expensive but are able to handle these artifacts relatively better. However, deep learning-based techniques are shown to be computationally efficient for automated brain registration but are sensitive to the intensity variations. In this study, we investigate the effect of variation in intensity distribution among input image pairs for deep learning-based image registration methods. We find a performance degradation of these models when brain image pairs with different intensity distribution are presented even with similar structures. To overcome this limitation, we incorporate a structural similarity-based loss function in a deep neural network and test its performance on the validation split separated before training as well as on a completely unseen new dataset. We report that the deep learning models trained with structure similarity-based loss seems to perform better for both datasets. This investigation highlights a possible performance limiting factor in deep learning-based registration models and suggests a potential solution to incorporate the intensity distribution variation in the input image pairs. Our code and models are available at https://github.com/hassaanmahmood/DeepIntense.

CVMay 22, 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Muhammad Bilal Shaikh, Syed Mohammed Shamsul Islam, Douglas Chai et al.

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.