Adnan Iftekhar

h-index11

12papers

388citations

Novelty52%

AI Score40

Ranked #76,036 of 194,257 authors (top 39%)#25,796 in CV (top 44%)

12 Papers

20.1CVApr 2, 2022

What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions

A S M Iftekhar, Hao Chen, Kaustav Kundu et al.

We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.

5.0CVJan 18, 2023

DDS: Decoupled Dynamic Scene-Graph Generation Network

A S M Iftekhar, Raphael Ruschel, Satish Kumar et al.

Scene-graph generation involves creating a structural representation of the relationships between objects in a scene by predicting subject-object-relation triplets from input data. Existing methods show poor performance in detecting triplets outside of a predefined set, primarily due to their reliance on dependent feature learning. To address this issue, we propose DDS -- a decoupled dynamic scene-graph generation network -- that consists of two independent branches that can disentangle extracted features. The key innovation of the current paper is the decoupling of the features representing the relationships from those of the objects, which enables the detection of novel object-relationship combinations. The DDS model is evaluated on three datasets and outperforms previous methods by a significant margin, especially in detecting previously unseen triplets.

8.1CVJun 1, 2022

Context-Driven Detection of Invertebrate Species in Deep-Sea Video

R. Austin McEver, Bowen Zhang, Connor Levenson et al.

Each year, underwater remotely operated vehicles (ROVs) collect thousands of hours of video of unexplored ocean habitats revealing a plethora of information regarding biodiversity on Earth. However, fully utilizing this information remains a challenge as proper annotations and analysis require trained scientists time, which is both limited and costly. To this end, we present a Dataset for Underwater Substrate and Invertebrate Analysis (DUSIA), a benchmark suite and growing large-scale dataset to train, validate, and test methods for temporally localizing four underwater substrates as well as temporally and spatially localizing 59 underwater invertebrate species. DUSIA currently includes over ten hours of footage across 25 videos captured in 1080p at 30 fps by an ROV following pre planned transects across the ocean floor near the Channel Islands of California. Each video includes annotations indicating the start and end times of substrates across the video in addition to counts of species of interest. Some frames are annotated with precise bounding box locations for invertebrate species of interest, as seen in Figure 1. To our knowledge, DUSIA is the first dataset of its kind for deep sea exploration, with video from a moving camera, that includes substrate annotations and invertebrate species that are present at significant depths where sunlight does not penetrate. Additionally, we present the novel context-driven object detector (CDD) where we use explicit substrate classification to influence an object detection network to simultaneously predict a substrate and species class influenced by that substrate. We also present a method for improving training on partially annotated bounding box frames. Finally, we offer a baseline method for automating the counting of invertebrate species of interest.

3.8LGOct 16, 2023Code

BLoad: Enhancing Neural Network Training with Efficient Sequential Data Handling

Raphael Ruschel, A. S. M. Iftekhar, B. S. Manjunath et al.

The increasing complexity of modern deep neural network models and the expanding sizes of datasets necessitate the development of optimized and scalable training methods. In this white paper, we addressed the challenge of efficiently training neural network models using sequences of varying sizes. To address this challenge, we propose a novel training scheme that enables efficient distributed data-parallel training on sequences of different sizes with minimal overhead. By using this scheme we were able to reduce the padding amount by more than 100$x$ while not deleting a single frame, resulting in an overall increased performance on both training time and Recall in our experiments.

13.1CVMay 22, 2025Code

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Zijia Lu, A S M Iftekhar, Gaurav Mittal et al.

Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at https://github.com/ZijiaLewisLu/CVPR2025-DeCafNet.

5.8CVNov 18, 2020Code

StressNet: Detecting Stress in Thermal Videos

Satish Kumar, A S M Iftekhar, Michael Goebel et al.

Precise measurement of physiological signals is critical for the effective monitoring of human vital signs. Recent developments in computer vision have demonstrated that signals such as pulse rate and respiration rate can be extracted from digital video of humans, increasing the possibility of contact-less monitoring. This paper presents a novel approach to obtaining physiological signals and classifying stress states from thermal video. The proposed network--"StressNet"--features a hybrid emission representation model that models the direct emission and absorption of heat by the skin and underlying blood vessels. This results in an information-rich feature representation of the face, which is used by spatio-temporal network for reconstructing the ISTI ( Initial Systolic Time Interval: a measure of change in cardiac sympathetic activity that is considered to be a quantitative index of stress in humans ). The reconstructed ISTI signal is fed into a stress-detection model to detect and classify the individual's stress state ( i.e. stress or no stress ). A detailed evaluation demonstrates that StressNet achieves estimated the ISTI signal with 95% accuracy and detect stress with average precision of 0.842. The source code is available on Github.

8.4CVFeb 7, 2025

Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

Minh-Quan Le, Gaurav Mittal, Tianjian Meng et al.

While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce $\textbf{Hummingbird}$, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks. Project page: https://roar-ai.github.io/hummingbird

3.8CRSep 15, 2021

Anti-Tamper Protection for Internet of Things System Using Hyperledger Fabric Blockchain Technology

Adnan Iftekhar, Xiaohui Cui

Automated and industrial Internet of Things (IoT) devices are increasing daily. As the number of IoT devices grows, the volume of data generated by them will also grow. Managing these rapidly expanding IoT devices and enormous data efficiently to be available to all authorized users without compromising its integrity will become essential in the near future. On the other side, many information security incidents have been recorded, increasing the requirement for countermeasures. While safeguards against hostile third parties have been commonplace until now, operators and parties have seen an increase in demand for data falsification detection and blocking. Blockchain technology is well-known for its privacy, immutability, and decentralized nature. Single-board computers are becoming more powerful while also becoming more affordable as IoT platforms. These single-board computers are gaining traction in the automation industry. This study focuses on a paradigm of IoT-Blockchain integration where the blockchain node runs autonomously on the IoT platform itself. It enables the system to conduct machine-to-machine transactions without the intervention of a person and to exert direct access control over IoT devices. This paper assumed that the readers are familiar with Hyperledger Fabric basic operations and focus on the practical approach of integration. A basic introduction is provided for the newbie on the blockchain.

8.0CVAug 2, 2021Code

GTNet:Guided Transformer Network for Detecting Human-Object Interactions

A S M Iftekhar, Satish Kumar, R. Austin McEver et al.

The human-object interaction (HOI) detection task refers to localizing humans, localizing objects, and predicting the interactions between each human-object pair. HOI is considered one of the fundamental steps in truly understanding complex visual scenes. For detecting HOI, it is important to utilize relative spatial configurations and object semantics to find salient spatial regions of images that highlight the interactions between human object pairs. This issue is addressed by the novel self-attention based guided transformer network, GTNet. GTNet encodes this spatial contextual information in human and object visual features via self-attention while achieving state of the art results on both the V-COCO and HICO-DET datasets. Code will be made available online.

2.9CRDec 28, 2020

Implementation of Security Systems for Detection and Prevention of Data Loss/Leakage at Organization via Traffic Inspection

Mir Hassan, Chen Jincai, Adnan Iftekhar et al.

Data Loss/Leakage Prevention (DLP) continues to be the main issue for many large organizations. There are multiple numbers of emerging security attach scenarios and a limitless number of overcoming solutions. Today's enterprises' major concern is to protect confidential information because a leakage that compromises confidential data means that sensitive information is in competitors' hands. Different data types need to be protected. However, our research is focused only on data in motion (DIM) i-e data transferred through the network. The research and scenarios in this paper demonstrate a recent survey on information and data leakage incidents, which reveals its importance and also proposed a model solution that will offer the combination of previous methodologies with a new way of pattern matching by advanced content checker based on the use of machine learning to protect data within an organization and then take actions accordingly. This paper also proposed a DLP deployment design on the gateway level that shows how data is moving through intermediate channels before reaching the final destination using the squid proxy server and ICAP server.

27.9CVMar 11, 2020Code

VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions

Oytun Ulutan, A S M Iftekhar, B. S. Manjunath

Comprehensive visual understanding requires detection frameworks that can effectively learn and utilize object interactions while analyzing objects individually. This is the main objective in Human-Object Interaction (HOI) detection task. In particular, relative spatial reasoning and structural connections between objects are essential cues for analyzing interactions, which is addressed by the proposed Visual-Spatial-Graph Network (VSGNet) architecture. VSGNet extracts visual features from the human-object pairs, refines the features with spatial configurations of the pair, and utilizes the structural connections between the pair via graph convolutions. The performance of VSGNet is thoroughly evaluated using the Verbs in COCO (V-COCO) and HICO-DET datasets. Experimental results indicate that VSGNet outperforms state-of-the-art solutions by 8% or 4 mAP in V-COCO and 16% or 3 mAP in HICO-DET.

0.9CVAug 23, 2017

CNN-Based Prediction of Frame-Level Shot Importance for Video Summarization

Mohaiminul Al Nahian, A. S. M. Iftekhar, Mohammad Tariqul Islam et al.

In the Internet, ubiquitous presence of redundant, unedited, raw videos has made video summarization an important problem. Traditional methods of video summarization employ a heuristic set of hand-crafted features, which in many cases fail to capture subtle abstraction of a scene. This paper presents a deep learning method that maps the context of a video to the importance of a scene similar to that is perceived by humans. In particular, a convolutional neural network (CNN)-based architecture is proposed to mimic the frame-level shot importance for user-oriented video summarization. The weights and biases of the CNN are trained extensively through off-line processing, so that it can provide the importance of a frame of an unseen video almost instantaneously. Experiments on estimating the shot importance is carried out using the publicly available database TVSum50. It is shown that the performance of the proposed network is substantially better than that of commonly referred feature-based methods for estimating the shot importance in terms of mean absolute error, absolute error variance, and relative F-measure.