AKM Mahbubur Rahman

h-index3

7papers

28citations

Novelty42%

AI Score41

Ranked #68,906 of 194,257 authors (top 35%)#23,457 in CV (top 40%)

7 Papers

12.8CLJul 6Code

BaFCo: A Document Understanding Benchmark for Complex Bangla Form Comprehension

Abu Tyeb Azad, Ishita Sur Apan, Fahim Ahmed et al.

Document comprehension is a challenging yet impactful task for Multimodal Large Language Models, especially as these systems see growing adoption in real-world, human-centric applications. However, this adoption is limited for low-resource languages such as Bangla due to the scarcity of high-quality annotated data. To address this gap, we introduce BaFCo, a benchmark dataset for Bangla form comprehension with a focus on Document Layout Analysis (DLA) and Key Information Extraction (KIE). BaFCo curates 200 multi-page complex Bangladeshi government forms, sourced from across diverse sectors including agriculture, education, banking, and land management. To accurately capture the structural and contextual complexity of these forms, we define a fine-grained annotation schema comprising 26 types of form entities, along with a separate coarse form entity set consisting of 5 types. We evaluate the latest MLLMs from the ChatGPT, Gemini, Claude, Qwen, and Kimi series using zero-shot and chain-of-thought prompts under both low and high reasoning setups. Our results reveal limitations in current MLLMs' ability in comprehending Bangla forms, particularly in accurately localizing highly granular form entities. Our dataset and code is available at: https://huggingface.co/datasets/Mausul/bafco

2.8CVJan 13Code

Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging

Md. Faiyaz Abdullah Sayeedi, Rashedur Rahman, Siam Tahsin Bhuiyan et al.

Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: https://github.com/faiyazabdullah/MultimodalMedAgent

5.8HEP-PHMay 3

E-PCN: Jet Tagging with Explainable Particle Chebyshev Networks Using Kinematic Features

Md Raqibul Islam, Adrita Khan, Mir Sazzat Hossain et al.

The identification and classification of collimated particle sprays, or jets, are essential for interpreting data from high-energy collider experiments. While deep learning has improved jet classification, it often lacks interpretability. We introduce the Explainable Particle Chebyshev Network (E-PCN), a graph neural network extending the Particle Chebyshev Network (PCN). E-PCN integrates kinematic variables into jet classification by constructing four graph representations per jet, each weighted by a distinct variable: angular separation ($Δ$), transverse momentum ($k_T$), momentum fraction ($z$), and invariant mass squared ($m^2$). We use the concept of Gradient-weighted Class Activation Mapping (Grad-CAM) to determine which kinematic variables dominate classification outcomes. Analysis reveals that angular separation and transverse momentum collectively account for approximately 76% of classification decisions (40.72% and 35.67%, respectively), with momentum fraction and invariant mass contributing the remaining 24%. Evaluated on the JetClass dataset with 10 signal classes, E-PCN achieves a macro-accuracy of 94.67%, macro-AUC of 96.78%, and macro-AUPR of 86.79%, representing improvements of 2.36%, 4.13%, and 24.88% respectively over the baseline PCN implementation, while demonstrating physically interpretable feature learning.

2.0CVJun 9, 2024

BD-SAT: High-resolution Land Use Land Cover Dataset & Benchmark Results for Developing Division: Dhaka, BD

Ovi Paul, Abu Bakar Siddik Nayem, Anis Sarker et al.

Land Use Land Cover (LULC) analysis on satellite images using deep learning-based methods is significantly helpful in understanding the geography, socio-economic conditions, poverty levels, and urban sprawl in developing countries. Recent works involve segmentation with LULC classes such as farmland, built-up areas, forests, meadows, water bodies, etc. Training deep learning methods on satellite images requires large sets of images annotated with LULC classes. However, annotated data for developing countries are scarce due to a lack of funding, absence of dedicated residential/industrial/economic zones, a large population, and diverse building materials. BD-SAT provides a high-resolution dataset that includes pixel-by-pixel LULC annotations for Dhaka metropolitan city and surrounding rural/urban areas. Using a strict and standardized procedure, the ground truth is created using Bing satellite imagery with a ground spatial distance of 2.22 meters per pixel. A three-stage, well-defined annotation process has been followed with support from GIS experts to ensure the reliability of the annotations. We performed several experiments to establish benchmark results. The results show that the annotated BD-SAT is sufficient to train large deep learning models with adequate accuracy for five major LULC classes: forest, farmland, built-up areas, water bodies, and meadows.

1.6LGAug 29, 2021

Deep Dive into Semi-Supervised ELBO for Improving Classification Performance

Fahim Faisal Niloy, M. Ashraful Amin, AKM Mahbubur Rahman et al.

Decomposition of the evidence lower bound (ELBO) objective of VAE used for density estimation revealed the deficiency of VAE for representation learning and suggested ways to improve the model. In this paper, we investigate whether we can get similar insights by decomposing the ELBO for semi-supervised classification using VAE model. Specifically, we show that mutual information between input and class labels decreases during maximization of ELBO objective. We propose a method to address this issue. We also enforce cluster assumption to aid in classification. Experiments on a diverse datasets verify that our method can be used to improve the classification performance of existing VAE based semi-supervised models. Experiments also show that, this can be achieved without sacrificing the generative power of the model.

4.7CVJun 24, 2021

Attention Toward Neighbors: A Context Aware Framework for High Resolution Image Segmentation

Fahim Faisal Niloy, M. Ashraful Amin, Amin Ahsan Ali et al.

High-resolution image segmentation remains challenging and error-prone due to the enormous size of intermediate feature maps. Conventional methods avoid this problem by using patch based approaches where each patch is segmented independently. However, independent patch segmentation induces errors, particularly at the patch boundary due to the lack of contextual information in very high-resolution images where the patch size is much smaller compared to the full image. To overcome these limitations, in this paper, we propose a novel framework to segment a particular patch by incorporating contextual information from its neighboring patches. This allows the segmentation network to see the target patch with a wider field of view without the need of larger feature maps. Comparative analysis from a number of experiments shows that our proposed framework is able to segment high resolution images with significantly improved mean Intersection over Union and overall accuracy.

1.2CVNov 25, 2020

Deep-learning coupled with novel classification method to classify the urban environment of the developing world

Qianwei Cheng, AKM Mahbubur Rahman, Anis Sarker et al.

Rapid globalization and the interdependence of humanity that engender tremendous in-flow of human migration towards the urban spaces. With advent of high definition satellite images, high resolution data, computational methods such as deep neural network, capable hardware; urban planning is seeing a paradigm shift. Legacy data on urban environments are now being complemented with high-volume, high-frequency data. In this paper we propose a novel classification method that is readily usable for machine analysis and show applicability of the methodology on a developing world setting. The state-of-the-art is mostly dominated by classification of building structures, building types etc. and largely represents the developed world which are insufficient for developing countries such as Bangladesh where the surrounding is crucial for the classification. Moreover, the traditional methods propose small-scale classifications, which give limited information with poor scalability and are slow to compute. We categorize the urban area in terms of informal and formal spaces taking the surroundings into account. 50 km x 50 km Google Earth image of Dhaka, Bangladesh was visually annotated and categorized by an expert. The classification is based broadly on two dimensions: urbanization and the architectural form of urban environment. Consequently, the urban space is divided into four classes: 1) highly informal; 2) moderately informal; 3) moderately formal; and 4) highly formal areas. In total 16 sub-classes were identified. For semantic segmentation, Google's DeeplabV3+ model was used which increases the field of view of the filters to incorporate larger context. Image encompassing 70% of the urban space was used for training and the remaining 30% was used for testing and validation. The model is able to segment with 75% accuracy and 60% Mean IoU.