Sayeed Shafayet Chowdhury

h-index14

9papers

243citations

Novelty49%

AI Score45

Ranked #43,668 of 194,257 authors (top 22%)#15,339 in CV (top 26%)

9 Papers

7.0CVMay 29

Detect Before You Leap: Mirage Detection in Vision-Language Models

Sayeed Shafayet Chowdhury, Md. Shaown Miah

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially concerning in medical and document visual question answering, where plausible but visually ungrounded responses may be mistaken for image-based evidence. We study pre-release mirage detection: given an image-question pair, the goal is to determine whether a VLM should answer or abstain before producing a response. We propose Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. TC-LIA projects layer-wise image patch tokens into the final CLIP embedding space and measures their similarity to the question embedding, allowing the method to track whether question-relevant visual evidence emerges across vision layers. The resulting alignment trajectory is summarized using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains, three input conditions, and twelve VLM backbones, the best systems achieve approximately 94.6-94.7% three-class detection accuracy with mirage rates below 3%, while baseline mirage rates range from 21.7% to 66.6%.

7.1LGDec 1, 2025

2D-ThermAl: Physics-Informed Framework for Thermal Analysis of Circuits using Generative AI

Soumyadeep Chandra, Sayeed Shafayet Chowdhury, Kaushik Roy

Thermal analysis is increasingly critical in modern integrated circuits, where non-uniform power dissipation and high transistor densities can cause rapid temperature spikes and reliability concerns. Traditional methods, such as FEM-based simulations offer high accuracy but computationally prohibitive for early-stage design, often requiring multiple iterative redesign cycles to resolve late-stage thermal failures. To address these challenges, we propose 'ThermAl', a physics-informed generative AI framework which effectively identifies heat sources and estimates full-chip transient and steady-state thermal distributions directly from input activity profiles. ThermAl employs a hybrid U-Net architecture enhanced with positional encoding and a Boltzmann regularizer to maintain physical fidelity. Our model is trained on an extensive dataset of heat dissipation maps, ranging from simple logic gates (e.g., inverters, NAND, XOR) to complex designs, generated via COMSOL. Experimental results demonstrate that ThermAl delivers precise temperature mappings for large circuits, with a root mean squared error (RMSE) of only 0.71°C, and outperforms conventional FEM tools by running up to ~200 times faster. We analyze performance across diverse layouts and workloads, and discuss its applicability to large-scale EDA workflows. While thermal reliability assessments often extend beyond 85°C for post-layout signoff, our focus here is on early-stage hotspot detection and thermal pattern learning. To ensure generalization beyond the nominal operating range 25-55°C, we additionally performed cross-validation on an extended dataset spanning 25-95°C maintaining a high accuracy (<2.2% full-scale RMSE) even under elevated temperature conditions representative of peak power and stress scenarios.

7.6CVMay 4, 2024

ViTALS: Vision Transformer for Action Localization in Surgical Nephrectomy

Soumyadeep Chandra, Sayeed Shafayet Chowdhury, Courtney Yong et al.

Surgical action localization is a challenging computer vision problem. While it has promising applications including automated training of surgery procedures, surgical workflow optimization, etc., appropriate model design is pivotal to accomplishing this task. Moreover, the lack of suitable medical datasets adds an additional layer of complexity. To that effect, we introduce a new complex dataset of nephrectomy surgeries called UroSlice. To perform the action localization from these videos, we propose a novel model termed as `ViTALS' (Vision Transformer for Action Localization in Surgical Nephrectomy). Our model incorporates hierarchical dilated temporal convolution layers and inter-layer residual connections to capture the temporal correlations at finer as well as coarser granularities. The proposed approach achieves state-of-the-art performance on Cholec80 and UroSlice datasets (89.8% and 66.1% accuracy, respectively), validating its effectiveness.

2.0CVJan 30, 2024

Towards Visual Syntactical Understanding

Sayeed Shafayet Chowdhury, Soumyadeep Chandra, Kaushik Roy

Syntax is usually studied in the realm of linguistics and refers to the arrangement of words in a sentence. Similarly, an image can be considered as a visual 'sentence', with the semantic parts of the image acting as 'words'. While visual syntactic understanding occurs naturally to humans, it is interesting to explore whether deep neural networks (DNNs) are equipped with such reasoning. To that end, we alter the syntax of natural images (e.g. swapping the eye and nose of a face), referred to as 'incorrect' images, to investigate the sensitivity of DNNs to such syntactic anomaly. Through our experiments, we discover an intriguing property of DNNs where we observe that state-of-the-art convolutional neural networks, as well as vision transformers, fail to discriminate between syntactically correct and incorrect images when trained on only correct ones. To counter this issue and enable visual syntactic understanding with DNNs, we propose a three-stage framework- (i) the 'words' (or the sub-features) in the image are detected, (ii) the detected words are sequentially masked and reconstructed using an autoencoder, (iii) the original and reconstructed parts are compared at each location to determine syntactic correctness. The reconstruction module is trained with BERT-like masked autoencoding for images, with the motivation to leverage language model inspired training to better capture the syntax. Note, our proposed approach is unsupervised in the sense that the incorrect images are only used during testing and the correct versus incorrect labels are never used for training. We perform experiments on CelebA, and AFHQ datasets and obtain classification accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes well to ImageNet samples which share common classes with CelebA and AFHQ without explicitly training on them.

21.1CLMay 24, 2023

Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model

Yinghan Long, Sayeed Shafayet Chowdhury, Kaushik Roy

Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage prohibitive for resource-constrained applications. To counter this, our approach is to divide the whole sequence into segments and apply attention to the individual segments. We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The loss caused by reducing the attention window length is compensated by aggregating information across segments with recurrent attention. SRformer leverages Recurrent Accumulate-and-Fire (RAF) neurons' inherent memory to update the cumulative product of keys and values. The segmented attention and lightweight RAF neurons ensure the efficiency of the proposed transformer. Such an approach leads to models with sequential processing capability at a lower computation/memory cost. We apply the proposed method to T5 and BART transformers. The modified models are tested on summarization datasets including CNN-dailymail, XSUM, ArXiv, and MediaSUM. Notably, using segmented inputs of varied sizes, the proposed model achieves $6-22\%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches. Furthermore, compared to full attention, the proposed model reduces the computational complexity of cross attention by around $40\%$.

1.2SYOct 2, 2021

Implementation of MPPT Technique of Solar Module with Supervised Machine Learning

Ruhi Sharmin, Sayeed Shafayet Chowdhury, Farihal Abedin et al.

In this paper, we proposed a method using supervised ML in solar PV system for MPPT analysis. For this purpose, an overall schematic diagram of a PV system is designed and simulated to create a dataset in MATLAB/ Simulink. Thus, by analyzing the output characteristics of a solar cell, an improved MPPT algorithm on the basis of neural network (NN) method is put forward to track the maximum power point (MPP) of solar cell modules. To perform the task, Bayesian Regularization method was chosen as the training algorithm as it works best even for smaller data supporting the wide range of the train data set. The theoretical results show that the improved NN MPPT algorithm has higher efficiency compared with the Perturb and Observe method in the same environment, and the PV system can keep working at MPP without oscillation and probability of any kind of misjudgment. So it can not only reduce misjudgment, but also avoid power loss around the MPP. Moreover, we implemented the algorithm in a hardware set-up and verified the theoretical result comparing it with the empirical data.

3.7IVJul 17, 2020

Anomaly Detection in Unsupervised Surveillance Setting Using Ensemble of Multimodal Data with Adversarial Defense

Sayeed Shafayet Chowdhury, Kaji Mejbaul Islam, Rouhan Noor

Autonomous aerial surveillance using drone feed is an interesting and challenging research domain. To ensure safety from intruders and potential objects posing threats to the zone being protected, it is crucial to be able to distinguish between normal and abnormal states in real-time. Additionally, we also need to consider any device malfunction. However, the inherent uncertainty embedded within the type and level of abnormality makes supervised techniques less suitable since the adversary may present a unique anomaly for intrusion. As a result, an unsupervised method for anomaly detection is preferable taking the unpredictable nature of attacks into account. Again in our case, the autonomous drone provides heterogeneous data streams consisting of images and other analog or digital sensor data, all of which can play a role in anomaly detection if they are ensembled synergistically. To that end, an ensemble detection mechanism is proposed here which estimates the degree of abnormality of analyzing the real-time image and IMU (Inertial Measurement Unit) sensor data in an unsupervised manner. First, we have implemented a Convolutional Neural Network (CNN) regression block, named AngleNet to estimate the angle between a reference image and current test image, which provides us with a measure of the anomaly of the device. Moreover, the IMU data are used in autoencoders to predict abnormality. Finally, the results from these two pipelines are ensembled to estimate the final degree of abnormality. Furthermore, we have applied adversarial attack to test the robustness and security of the proposed approach and integrated defense mechanism. The proposed method performs satisfactorily on the IEEE SP Cup-2020 dataset with an accuracy of 97.8%. Additionally, we have also tested this approach on an in-house dataset to validate its robustness.

17.5NEJun 15, 2020

Towards Understanding the Effect of Leak in Spiking Neural Networks

Sayeed Shafayet Chowdhury, Chankyu Lee, Kaushik Roy

Spiking Neural Networks (SNNs) are being explored to emulate the astounding capabilities of human brain that can learn and compute functions robustly and efficiently with noisy spiking activities. A variety of spiking neuron models have been proposed to resemble biological neuronal functionalities. With varying levels of bio-fidelity, these models often contain a leak path in their internal states, called membrane potentials. While the leaky models have been argued as more bioplausible, a comparative analysis between models with and without leak from a purely computational point of view demands attention. In this paper, we investigate the questions regarding the justification of leak and the pros and cons of using leaky behavior. Our experimental results reveal that leaky neuron model provides improved robustness and better generalization compared to models with no leak. However, leak decreases the sparsity of computation contrary to the common notion. Through a frequency domain analysis, we demonstrate the effect of leak in eliminating the high-frequency components from the input, thus enabling SNNs to be more robust against noisy spike-inputs.

2.3CVJun 5, 2020

Unsupervised Abnormality Detection Using Heterogeneous Autonomous Systems

Sayeed Shafayet Chowdhury, Kazi Mejbaul Islam, Rouhan Noor

Anomaly detection (AD) in a surveillance scenario is an emerging and challenging field of research. For autonomous vehicles like drones or cars, it is immensely important to distinguish between normal and abnormal states in real-time. Additionally, we also need to detect any device malfunction. But the nature and degree of abnormality may vary depending upon the actual environment and adversary. As a result, it is impractical to model all cases a-priori and use supervised methods to classify. Also, an autonomous vehicle provides various data types like images and other analog or digital sensor data, all of which can be useful in anomaly detection if leveraged fruitfully. To that effect, in this paper, a heterogeneous system is proposed which estimates the degree of abnormality of an unmanned surveillance drone, analyzing real-time image and IMU (Inertial Measurement Unit) sensor data in an unsupervised manner. Here, we have demonstrated a Convolutional Neural Network (CNN) architecture, named AngleNet to estimate the angle between a normal image and another image under consideration, which provides us with a measure of anomaly of the device. Moreover, the IMU data are used in autoencoder to predict abnormality. Finally, the results from these two algorithms are ensembled to estimate the final degree of abnormality. The proposed method performs satisfactorily on the IEEE SP Cup-2020 dataset with an accuracy of 97.3%. Additionally, we have also tested this approach on an in-house dataset to validate its robustness.