Zainy M. Malakan

h-index3
2papers

2 Papers

CVOct 6, 2022
Vision Transformer Based Model for Describing a Set of Images as a Story

Zainy M. Malakan, Ghulam Mubashar Hassan, Ajmal Mian

Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.

CVOct 24, 2025
Urban 3D Change Detection Using LiDAR Sensor for HD Map Maintenance and Smart Mobility

Hezam Albagami, Haitian Wang, Xinyu Wang et al.

High-definition 3D city maps underpin smart transportation, digital twins, and autonomous driving, where object level change detection across bi temporal LiDAR enables HD map maintenance, construction monitoring, and reliable localization. Classical DSM differencing and image based methods are sensitive to small vertical bias, ground slope, and viewpoint mismatch and yield cellwise outputs without object identity. Point based neural models and voxel encodings demand large memory, assume near perfect pre alignment, degrade thin structures, and seldom enforce class consistent association, which leaves split or merge cases unresolved and ignores uncertainty. We propose an object centric, uncertainty aware pipeline for city scale LiDAR that aligns epochs with multi resolution NDT followed by point to plane ICP, normalizes height, and derives a per location level of detection from registration covariance and surface roughness to calibrate decisions and suppress spurious changes. Geometry only proxies seed cross epoch associations that are refined by semantic and instance segmentation and a class constrained bipartite assignment with augmented dummies to handle splits and merges while preserving per class counts. Tiled processing bounds memory without eroding narrow ground changes, and instance level decisions combine 3D overlap, normal direction displacement, and height and volume differences with a histogram distance, all gated by the local level of detection to remain stable under partial overlap and sampling variation. On 15 representative Subiaco blocks the method attains 95.2% accuracy, 90.4% mF1, and 82.6% mIoU, exceeding Triplet KPConv by 0.2 percentage points in accuracy, 0.2 in mF1, and 0.8 in mIoU, with the largest gain on Decreased where IoU reaches 74.8% and improves by 7.6 points.