Muhammad Ishfaq Hussain

CV
h-index28
4papers
9citations
Novelty48%
AI Score36

4 Papers

CVNov 13, 2025
Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

Zubia Naz, Farhan Asghar, Muhammad Ishfaq Hussain et al.

Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

CVOct 19, 2024Code
3D Multi-Object Tracking Employing MS-GLMB Filter for Autonomous Driving

Linh Van Ma, Muhammad Ishfaq Hussain, Kin-Choong Yow et al.

The MS-GLMB filter offers a robust framework for tracking multiple objects through the use of multi-sensor data. Building on this, the MV-GLMB and MV-GLMB-AB filters enhance the MS-GLMB capabilities by employing cameras for 3D multi-sensor multi-object tracking, effectively addressing occlusions. However, both filters depend on overlapping fields of view from the cameras to combine complementary information. In this paper, we introduce an improved approach that integrates an additional sensor, such as LiDAR, into the MS-GLMB framework for 3D multi-object tracking. Specifically, we present a new LiDAR measurement model, along with a multi-camera and LiDAR multi-object measurement model. Our experimental results demonstrate a significant improvement in tracking performance compared to existing MS-GLMB-based methods. Importantly, our method eliminates the need for overlapping fields of view, broadening the applicability of the MS-GLMB filter. Our source code for nuScenes dataset is available at https://github.com/linh-gist/ms-glmb-nuScenes.

CVDec 4, 2023
Adaptive Confidence Threshold for ByteTrack in Multi-Object Tracking

Linh Van Ma, Muhammad Ishfaq Hussain, JongHyun Park et al.

We investigate the application of ByteTrack in the realm of multiple object tracking. ByteTrack, a simple tracking algorithm, enables the simultaneous tracking of multiple objects by strategically incorporating detections with a low confidence threshold. Conventionally, objects are initially associated with high confidence threshold detections. When the association between objects and detections becomes ambiguous, ByteTrack extends the association to lower confidence threshold detections. One notable drawback of the existing ByteTrack approach is its reliance on a fixed threshold to differentiate between high and low-confidence detections. In response to this limitation, we introduce a novel and adaptive approach. Our proposed method entails a dynamic adjustment of the confidence threshold, leveraging insights derived from overall detections. Through experimentation, we demonstrate the effectiveness of our adaptive confidence threshold technique while maintaining running time compared to ByteTrack.

LGOct 15, 2025
SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB

Muhammad Ishfaq Hussain, Ma Van Linh, Zubia Naz et al.

Enhancing scene understanding in adverse visibility conditions remains a critical challenge for surveillance and autonomous navigation systems. Conventional imaging modalities, such as RGB and thermal infrared (MWIR / LWIR), when fused, often struggle to deliver comprehensive scene information, particularly under conditions of atmospheric interference or inadequate illumination. To address these limitations, Short-Wave Infrared (SWIR) imaging has emerged as a promising modality due to its ability to penetrate atmospheric disturbances and differentiate materials with improved clarity. However, the advancement and widespread implementation of SWIR-based systems face significant hurdles, primarily due to the scarcity of publicly accessible SWIR datasets. In response to this challenge, our research introduces an approach to synthetically generate SWIR-like structural/contrast cues (without claiming spectral reproduction) images from existing LWIR data using advanced contrast enhancement techniques. We then propose a multimodal fusion framework integrating synthetic SWIR, LWIR, and RGB modalities, employing an optimized encoder-decoder neural network architecture with modality-specific encoders and a softmax-gated fusion head. Comprehensive experiments on public RGB-LWIR benchmarks (M3FD, TNO, CAMEL, MSRS, RoadScene) and an additional private real RGB-MWIR-SWIR dataset demonstrate that our synthetic-SWIR-enhanced fusion framework improves fused-image quality (contrast, edge definition, structural fidelity) while maintaining real-time performance. We also add fair trimodal baselines (LP, LatLRR, GFF) and cascaded trimodal variants of U2Fusion/SwinFusion under a unified protocol. The outcomes highlight substantial potential for real-world applications in surveillance and autonomous systems.