CVMay 19
Aero-World: Action-Conditioned Aerial Video Generation from Inertial ControlsAbdul Mohaimen Al Radi, Kunyang Li, Yuzhang Shang et al.
Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.
CVMay 9
CATS: Curvature Aware Temporal Selection for efficient long video understandingMehrajul Abadin Miraj, Abdul Mohaimen Al Radi, Shariful Islam Rayhan et al.
Understanding long videos with multimodal large language models (MLLMs) requires selecting a small subset of informative frames under strict computational budgets, where exhaustive processing is infeasible and optimal selection is combinatorial. We propose CATS, a curvature-aware frame selection method that explicitly models the temporal geometry of query-frame relevance to identify salient events and their surrounding context. By leveraging temporal curvature to adapt selection density, CATS captures both abrupt transitions and gradually evolving content while suppressing redundant frames. Under a fixed backbone and frame budget, CATS consistently outperforms prior lightweight approaches such as AKS on LongVideoBench and VideoMME. While multi-stage methods such as MIRA achieve higher absolute accuracy, they incur substantial computational overhead; in contrast, CATS retains approximately 93-95% of MIRA's performance while requiring only 3-4% of its preprocessing cost, yielding a favorable efficiency-accuracy trade-off. Beyond answer accuracy, we evaluate description generation using an LLM-as-a-judge protocol, and the obtained results show that CATS produces more coherent and informative outputs, indicating improved grounding in visual evidence. These results position CATS as a computationally efficient and principled approach to long-video understanding.
IVNov 13, 2025
From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image DeblurringSyed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder et al.
Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.
CVJun 24, 2025
Deblurring in the Wild: A Real-World Image Deblurring Dataset from Smartphone High-Speed VideosSyed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder et al.
We introduce the largest real-world image deblurring dataset constructed from smartphone slow-motion videos. Using 240 frames captured over one second, we simulate realistic long-exposure blur by averaging frames to produce blurry images, while using the temporally centered frame as the sharp reference. Our dataset contains over 42,000 high-resolution blur-sharp image pairs, making it approximately 10 times larger than widely used datasets, with 8 times the amount of different scenes, including indoor and outdoor environments, with varying object and camera motions. We benchmark multiple state-of-the-art (SOTA) deblurring models on our dataset and observe significant performance degradation, highlighting the complexity and diversity of our benchmark. Our dataset serves as a challenging new benchmark to facilitate robust and generalizable deblurring models.
CVJun 12, 2024
Blind Image Deblurring with FFT-ReLU Sparsity PriorAbdul Mohaimen Al Radi, Prothito Shovon Majumder, Md. Mosaddek Khan
Blind image deblurring is the process of recovering a sharp image from a blurred one without prior knowledge about the blur kernel. It is a small data problem, since the key challenge lies in estimating the unknown degrees of blur from a single image or limited data, instead of learning from large datasets. The solution depends heavily on developing algorithms that effectively model the image degradation process. We introduce a method that leverages a prior which targets the blur kernel to achieve effective deblurring across a wide range of image types. In our extensive empirical analysis, our algorithm achieves results that are competitive with the state-of-the-art blind image deblurring algorithms, and it offers up to two times faster inference, making it a highly efficient solution.
CRNov 11, 2021
An End-to-End Authentication Mechanism for Wireless Body Area NetworksMosarrat Jahan, Fatema Tuz Zohra, Md. Kamal Parvez et al.
Wireless Body Area Network (WBAN) ensures high-quality healthcare services by endowing distant and continual monitoring of patients' health conditions. The security and privacy of the sensitive health-related data transmitted through the WBAN should be preserved to maximize its benefits. In this regard, user authentication is one of the primary mechanisms to protect health data that verifies the identities of entities involved in the communication process. Since WBAN carries crucial health data, every entity engaged in the data transfer process must be authenticated. In literature, an end-to-end user authentication mechanism covering each communicating party is absent. Besides, most of the existing user authentication mechanisms are designed assuming that the patient's mobile phone is trusted. In reality, a patient's mobile phone can be stolen or comprised by malware and thus behaves maliciously. Our work addresses these drawbacks and proposes an end-to-end user authentication and session key agreement scheme between sensor nodes and medical experts in a scenario where the patient's mobile phone is semi-trusted. We present a formal security analysis using BAN logic. Besides, we also provide an informal security analysis of the proposed scheme. Both studies indicate that our method is robust against well-known security attacks. In addition, our scheme achieves comparable computation and communication costs concerning the related existing works. The simulation shows that our method preserves satisfactory network performance.