29.6CVJun 3
Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon BasinSayan Mandal, Rocco Sedona, Simon Besnard et al.
Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.
DCFeb 15
AEG: A Baremetal Framework for AI Acceleration via Direct Hardware Access in Heterogeneous AcceleratorsHua Jiang, Sayan Mandal, Brandon Kirincich et al.
This paper introduces a unified, hardware-independent baremetal runtime architecture designed to enable high-performance machine learning (ML) inference on heterogeneous accelerators, such as AI Engine (AIE) arrays, without the overhead of an underlying real-time or general-purpose operating system. Existing edge-deployment frameworks, such as TinyML, often rely on real-time operating systems (RTOS), which introduce unnecessary complexity and performance bottlenecks. To address this, our solution fundamentally decouples the runtime from hardware specifics by flattening complex control logic into linear, executable Runtime Control Blocks (RCBs). This "Control as Data" paradigm allows high-level models, including Adaptive Data Flow (ADF) graphs, to be executed by a generic engine through a minimal Runtime Hardware Abstraction Layer (RHAL). We further integrate Runtime Platform Management (RTPM) to handle system-level orchestration (including a lightweight network stack) and a Runtime In-Memory File System (RIMFS) to manage data in OS-free environments. We demonstrate the framework's efficacy with a ResNet-18 image classification implementation. Experimental results show 9.2$\times$ higher compute efficiency (throughput per AIE tile) compared to Linux-based Vitis AI deployment, 3--7$\times$ reduction in data movement overhead, and near-zero latency variance (CV~$=0.03\%$). The system achieves 68.78\% Top-1 accuracy on ImageNet using only 28 AIE tiles compared to Vitis AI's 304 tiles, validating both the efficiency and correctness of this unified bare-metal architecture.
SEOct 11, 2025
Grounded AI for Code Review: Resource-Efficient Large-Model Serving in Enterprise PipelinesSayan Mandal, Hua Jiang
Automated code review adoption lags in compliance-heavy settings, where static analyzers produce high-volume, low-rationale outputs, and naive LLM use risks hallucination and incurring cost overhead. We present a production system for grounded, PR-native review that pairs static-analysis findings with AST-guided context extraction and a single-GPU, on-demand serving stack (quantized open-weight model, multi-tier caching) to deliver concise explanations and remediation guidance. Evaluated on safety-oriented C/C++ standards, the approach achieves sub-minute median first-feedback (offline p50 build+LLM 59.8s) while maintaining competitive violation reduction and lower violation rates versus larger proprietary models. The architecture is decoupled: teams can adopt the grounding/prompting layer or the serving layer independently. A small internal survey (n=8) provides directional signals of reduced triage effort and moderate perceived grounding, with participants reporting fewer human review iterations. We outline operational lessons and limitations, emphasizing reproducibility, auditability, and pathways to broader standards and assisted patching.
CVOct 11, 2025
SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus SegmentationSayan Mandal, Divyadarshini Karthikeyan, Manas Paldhe
We propose SAM2LoRA, a parameter-efficient fine-tuning strategy that adapts the Segment Anything Model 2 (SAM2) for fundus image segmentation. SAM2 employs a masked autoencoder-pretrained Hierarchical Vision Transformer for multi-scale feature decoding, enabling rapid inference in low-resource settings; however, fine-tuning remains challenging. To address this, SAM2LoRA integrates a low-rank adapter into both the image encoder and mask decoder, requiring fewer than 5\% of the original trainable parameters. Our analysis indicates that for cross-dataset fundus segmentation tasks, a composite loss function combining segmentationBCE, SoftDice, and FocalTversky losses is essential for optimal network tuning. Evaluated on 11 challenging fundus segmentation datasets, SAM2LoRA demonstrates high performance in both blood vessel and optic disc segmentation under cross-dataset training conditions. It achieves Dice scores of up to 0.86 and 0.93 for blood vessel and optic disc segmentation, respectively, and AUC values of up to 0.98 and 0.99, achieving state-of-the-art performance while substantially reducing training overhead.
CVJun 9, 2024
Deep Learning to Predict Glaucoma Progression using Structural Changes in the EyeSayan Mandal
Glaucoma is a chronic eye disease characterized by optic neuropathy, leading to irreversible vision loss. It progresses gradually, often remaining undiagnosed until advanced stages. Early detection is crucial to monitor atrophy and develop treatment strategies to prevent further vision impairment. Data-centric methods have enabled computer-aided algorithms for precise glaucoma diagnosis. In this study, we use deep learning models to identify complex disease traits and progression criteria, detecting subtle changes indicative of glaucoma. We explore the structure-function relationship in glaucoma progression and predict functional impairment from structural eye deterioration. We analyze statistical and machine learning methods, including deep learning techniques with optical coherence tomography (OCT) scans for accurate progression prediction. Addressing challenges like age variability, data imbalances, and noisy labels, we develop novel semi-supervised time-series algorithms: 1. Weakly-Supervised Time-Series Learning: We create a CNN-LSTM model to encode spatiotemporal features from OCT scans. This approach uses age-related progression and positive-unlabeled data to establish robust pseudo-progression criteria, bypassing gold-standard labels. 2. Semi-Supervised Time-Series Learning: Using labels from Guided Progression Analysis (GPA) in a contrastive learning scheme, the CNN-LSTM architecture learns from potentially mislabeled data to improve prediction accuracy. Our methods outperform conventional and state-of-the-art techniques.
IVOct 4, 2021
Assessing glaucoma in retinal fundus photographs using Deep Feature Consistent Variational AutoencodersSayan Mandal, Alessandro A. Jammal, Felipe A. Medeiros
One of the leading causes of blindness is glaucoma, which is challenging to detect since it remains asymptomatic until the symptoms are severe. Thus, diagnosis is usually possible until the markers are easy to identify, i.e., the damage has already occurred. Early identification of glaucoma is generally made based on functional, structural, and clinical assessments. However, due to the nature of the disease, researchers still debate which markers qualify as a consistent glaucoma metric. Deep learning methods have partially solved this dilemma by bypassing the marker identification stage and analyzing high-level information directly to classify the data. Although favorable, these methods make expert analysis difficult as they provide no insight into the model discrimination process. In this paper, we overcome this using deep generative networks, a deep learning model that learns complicated, high-dimensional probability distributions. We train a Deep Feature consistent Variational Autoencoder (DFC-VAE) to reconstruct optic disc images. We show that a small-sized latent space obtained from the DFC-VAE can learn the high-dimensional glaucoma data distribution and provide discriminatory evidence between normal and glaucoma eyes. Latent representations of size as low as 128 from our model got a 0.885 area under the receiver operating characteristic curve when trained with Support Vector Classifier.
ASSep 21, 2020
End-to-End Bengali Speech RecognitionSayan Mandal, Sarthak Yadav, Atul Rai
Bengali is a prominent language of the Indian subcontinent. However, while many state-of-the-art acoustic models exist for prominent languages spoken in the region, research and resources for Bengali are few and far between. In this work, we apply CTC based CNN-RNN networks, a prominent deep learning based end-to-end automatic speech recognition technique, to the Bengali ASR task. We also propose and evaluate the applicability and efficacy of small 7x3 and 3x3 convolution kernels which are prominently used in the computer vision domain primarily because of their FLOPs and parameter efficient nature. We propose two CNN blocks, 2-layer Block A and 4-layer Block B, with the first layer comprising of 7x3 kernel and the subsequent layers comprising solely of 3x3 kernels. Using the publicly available Large Bengali ASR Training data set, we benchmark and evaluate the performance of seven deep neural network configurations of varying complexities and depth on the Bengali ASR task. Our best model, with Block B, has a WER of 13.67, having an absolute reduction of 1.39% over comparable model with larger convolution kernels of size 41x11 and 21x11.