Blessing Agyei Kyem

CV
h-index21
15papers
63citations
Novelty44%
AI Score50

15 Papers

50.0CVMay 27
A Road-Conditioned Traffic Movie Prediction Network with Spatiotemporal and Structure-Consistent Learning

Joshua Kofi Asamoah, Blessing Agyei Kyem, Armstrong Aboah

City-wide traffic forecasting is important for congestion management, route guidance, and intelligent transportation systems, but accurate prediction remains challenging when future traffic must be generated as spatial maps over an entire urban network. Existing traffic movie prediction methods have improved frame-level accuracy, yet many still treat forecasting mainly as image reconstruction. This can produce traffic maps that are numerically close to the ground truth but weakly constrained by road layout, connectivity, travel direction, and congestion propagation, especially in cross-city settings where both traffic behavior and road structure change. To address this limitation, this study proposes RCSNet, a road-conditioned spatiotemporal network that reformulates traffic movie prediction as topology-guided future-state generation. RCSNet extracts road-aware representations from static road maps, models multi-horizon traffic dynamics from historical observations, aligns directional traffic features with local road structure, and progressively generates future traffic maps for improved temporal consistency. A structure-consistent learning objective further encourages predictions to remain accurate, road-aligned, and spatially stable. Experiments across multiple cities show that RCSNet improves both forecasting accuracy and structural consistency. In same-city forecasting on Berlin, Antwerp, and Moscow, RCSNet reduces average MAE, MSE, and RMSE by 11.5%, 10.0%, and 5.1%, respectively, compared with the closest baseline. In cross-city testing on unseen Chicago and Bangkok, it reduces RMSE by 10.6% and 10.5% without target-city fine-tuning. Additional horizon-wise, road-structure, explainability, statistical, and efficiency analyses show that RCSNet produces more accurate, transferable, road-aligned, and computationally efficient traffic forecasts.

41.2CVJun 4
Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh et al.

Reliable crack assessment requires not only accurate pixel-level masks but also connected crack geometry and confidence estimates that remain stable under domain shift. However, existing segmentation models can achieve high overlap scores while fragmenting cracks, missing fine branches, and providing no calibrated uncertainty. To address this gap, this paper proposes CrackGeoFM, a multi-task framework that combines a frozen visual foundation backbone with crack-specific adaptation for mask prediction, skeleton reconstruction, and uncertainty estimation. The framework integrates a Frequency-Guided Crack Enhancement Module (FCEM) to enhance high-frequency crack cues, a Crack-Domain Feature Adaptation Module (CFAM) to adapt frozen backbone features to crack-domain patterns, and a Structure-Aware Multi-Task Decoder (SMTD) to jointly decode masks, skeletons, and uncertainty. Across 20 crack datasets, CrackGeoFM achieves state-of-the-art segmentation, improved topology preservation, calibrated uncertainty, and effective few-shot adaptation with only five labeled images. These results support reliable, generalizable, and engineering-oriented crack analysis for infrastructure assessment.

14.2CVMay 6
Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition

Eugene Kofi Okrah Denteh, Blessing Agyei Kyem, Joshua Kofi Asamoah et al.

Accurate traffic congestion classification requires models that jointly capture roadway scene context and non-stationary traffic motion, yet most prior work treats these requirements in isolation. Vision-based methods often depend on appearance cues with standard temporal pooling, which can bias predictions toward static infrastructure, whereas signal-based approaches characterize temporal dynamics but lack the spatial context needed for scene-level localization. These complementary limitations motivate a unified framework that links motion evidence to spatial feature selection while preserving data-adaptive temporal characterization. This study therefore proposes FLO-EMD, a hybrid approach that couples motion-guided attention with empirical, data-driven temporal decomposition. Dense optical flow guides channel and spatial attention so that RGB features are refined toward motion-relevant regions. In parallel, aggregated flow statistics form compact motion traces that are decomposed using Empirical Mode Decomposition (EMD) to extract intrinsic temporal components. The resulting EMD embedding is fused with learned spatiotemporal representations to classify light, medium, and heavy congestion. Experiments on 1,050 five-second clips from four surveillance networks show that FLO-EMD achieves 97.5% overall test accuracy (weighted F1 = 0.9742), outperforming established baselines and remaining robust across diverse environmental conditions; ablation and sensitivity analyses further quantify the contributions of EMD, the number of intrinsic mode functions, and the selected motion descriptors.

CVAug 10, 2024
Advancing Pavement Distress Detection in Developing Countries: A Novel Deep Learning Approach with Locally-Collected Datasets

Blessing Agyei Kyem, Eugene Kofi Okrah Denteh, Joshua Kofi Asamoah et al.

Road infrastructure maintenance in developing countries faces unique challenges due to resource constraints and diverse environmental factors. This study addresses the critical need for efficient, accurate, and locally-relevant pavement distress detection methods in these regions. We present a novel deep learning approach combining YOLO (You Only Look Once) object detection models with a Convolutional Block Attention Module (CBAM) to simultaneously detect and classify multiple pavement distress types. The model demonstrates robust performance in detecting and classifying potholes, longitudinal cracks, alligator cracks, and raveling, with confidence scores ranging from 0.46 to 0.93. While some misclassifications occur in complex scenarios, these provide insights into unique challenges of pavement assessment in developing countries. Additionally, we developed a web-based application for real-time distress detection from images and videos. This research advances automated pavement distress detection and provides a tailored solution for developing countries, potentially improving road safety, optimizing maintenance strategies, and contributing to sustainable transportation infrastructure development.

CVAug 7, 2024
PaveCap: The First Multimodal Framework for Comprehensive Pavement Condition Assessment with Dense Captioning and PCI Estimation

Blessing Agyei Kyem, Eugene Kofi Okrah Denteh, Joshua Kofi Asamoah et al.

This research introduces the first multimodal approach for pavement condition assessment, providing both quantitative Pavement Condition Index (PCI) predictions and qualitative descriptions. We introduce PaveCap, a novel framework for automated pavement condition assessment. The framework consists of two main parts: a Single-Shot PCI Estimation Network and a Dense Captioning Network. The PCI Estimation Network uses YOLOv8 for object detection, the Segment Anything Model (SAM) for zero-shot segmentation, and a four-layer convolutional neural network to predict PCI. The Dense Captioning Network uses a YOLOv8 backbone, a Transformer encoder-decoder architecture, and a convolutional feed-forward module to generate detailed descriptions of pavement conditions. To train and evaluate these networks, we developed a pavement dataset with bounding box annotations, textual annotations, and PCI values. The results of our PCI Estimation Network showed a strong positive correlation (0.70) between predicted and actual PCIs, demonstrating its effectiveness in automating condition assessment. Also, the Dense Captioning Network produced accurate pavement condition descriptions, evidenced by high BLEU (0.7445), GLEU (0.5893), and METEOR (0.7252) scores. Additionally, the dense captioning model handled complex scenarios well, even correcting some errors in the ground truth data. The framework developed here can greatly improve infrastructure management and decision18 making in pavement maintenance.

14.4CVApr 11
Class-Adaptive Cooperative Perception for Multi-Class LiDAR-based 3D Object Detection in V2X Systems

Blessing Agyei Kyem, Joshua Kofi Asamoah, Armstrong Aboah

Cooperative perception allows connected vehicles and roadside infrastructure to share sensor observations, creating a fused scene representation beyond the capability of any single platform. However, most cooperative 3D object detectors use a uniform fusion strategy for all object classes, which limits their ability to handle the different geometric structures and point-sampling patterns of small and large objects. This problem is further reinforced by narrow evaluation protocols that often emphasize a single dominant class or only a few cooperation settings, leaving robust multi-class detection across diverse vehicle-to-everything interactions insufficiently explored. To address this gap, we propose a class-adaptive cooperative perception architecture for multi-class 3D object detection from LiDAR data. The model integrates four components: multi-scale window attention with learned scale routing for spatially adaptive feature extraction, a class-specific fusion module that separates small and large objects into attentive fusion pathways, bird's-eye-view enhancement through parallel dilated convolution and channel recalibration for richer contextual representation, and class-balanced objective weighting to reduce bias toward frequent categories. Experiments on the V2X-Real benchmark cover vehicle-centric, infrastructure-centric, vehicle-to-vehicle, infrastructure-to-infrastructure, and vehicle-to-infrastructure settings under identical backbone and training configurations. The proposed method consistently improves mean detection performance over strong intermediate-fusion baselines, with the largest gains on trucks, clear improvements on pedestrians, and competitive results on cars. These results show that aligning feature extraction and fusion with class-dependent geometry and point density leads to more balanced cooperative perception in realistic vehicle-to-everything deployments.

CVDec 23, 2025
PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh et al.

Automated pavement defect detection often struggles to generalize across diverse real-world conditions due to the lack of standardized datasets. Existing datasets differ in annotation styles, distress type definitions, and formats, limiting their integration for unified training. To address this gap, we introduce a comprehensive benchmark dataset that consolidates multiple publicly available sources into a standardized collection of 52747 images from seven countries, with 135277 bounding box annotations covering 13 distinct distress types. The dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions, offering a unique resource for consistent training and evaluation. Its effectiveness was demonstrated through benchmarking with state-of-the-art object detection models including YOLOv8-YOLOv12, Faster R-CNN, and DETR, which achieved competitive performance across diverse scenarios. By standardizing class definitions and annotation formats, this dataset provides the first globally representative benchmark for pavement defect detection and enables fair comparison of models, including zero-shot transfer to new environments.

CVJan 24, 2025
Context-CrackNet: A Context-Aware Framework for Precise Segmentation of Tiny Cracks in Pavement images

Blessing Agyei Kyem, Joshua Kofi Asamoah, Armstrong Aboah

The accurate detection and segmentation of pavement distresses, particularly tiny and small cracks, are critical for early intervention and preventive maintenance in transportation infrastructure. Traditional manual inspection methods are labor-intensive and inconsistent, while existing deep learning models struggle with fine-grained segmentation and computational efficiency. To address these challenges, this study proposes Context-CrackNet, a novel encoder-decoder architecture featuring the Region-Focused Enhancement Module (RFEM) and Context-Aware Global Module (CAGM). These innovations enhance the model's ability to capture fine-grained local details and global contextual dependencies, respectively. Context-CrackNet was rigorously evaluated on ten publicly available crack segmentation datasets, covering diverse pavement distress scenarios. The model consistently outperformed 9 state-of-the-art segmentation frameworks, achieving superior performance metrics such as mIoU and Dice score, while maintaining competitive inference efficiency. Ablation studies confirmed the complementary roles of RFEM and CAGM, with notable improvements in mIoU and Dice score when both modules were integrated. Additionally, the model's balance of precision and computational efficiency highlights its potential for real-time deployment in large-scale pavement monitoring systems.

CVMar 27, 2025
Integrating Travel Behavior Forecasting and Generative Modeling for Predicting Future Urban Mobility and Spatial Transformations

Eugene Denteh, Andrews Danyo, Joshua Kofi Asamoah et al.

Transportation planning plays a critical role in shaping urban development, economic mobility, and infrastructure sustainability. However, traditional planning methods often struggle to accurately predict long-term urban growth and transportation demands. This may sometimes result in infrastructure demolition to make room for current transportation planning demands. This study integrates a Temporal Fusion Transformer to predict travel patterns from demographic data with a Generative Adversarial Network to predict future urban settings through satellite imagery. The framework achieved a 0.76 R-square score in travel behavior prediction and generated high-fidelity satellite images with a Structural Similarity Index of 0.81. The results demonstrate that integrating predictive analytics and spatial visualization can significantly improve the decision-making process, fostering more sustainable and efficient urban development. This research highlights the importance of data-driven methodologies in modern transportation planning and presents a step toward optimizing infrastructure placement, capacity, and long-term viability.

CVOct 22, 2025
A Unified Detection Pipeline for Robust Object Detection in Fisheye-Based Traffic Surveillance

Neema Jakisa Owor, Joshua Kofi Asamoah, Tanner Wambui Muturi et al.

Fisheye cameras offer an efficient solution for wide-area traffic surveillance by capturing large fields of view from a single vantage point. However, the strong radial distortion and nonuniform resolution inherent in fisheye imagery introduce substantial challenges for standard object detectors, particularly near image boundaries where object appearance is severely degraded. In this work, we present a detection framework designed to operate robustly under these conditions. Our approach employs a simple yet effective pre and post processing pipeline that enhances detection consistency across the image, especially in regions affected by severe distortion. We train several state-of-the-art detection models on the fisheye traffic imagery and combine their outputs through an ensemble strategy to improve overall detection accuracy. Our method achieves an F1 score of0.6366 on the 2025 AI City Challenge Track 4, placing 8thoverall out of 62 teams. These results demonstrate the effectiveness of our framework in addressing issues inherent to fisheye imagery.

CVOct 13, 2025
Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

Tanner Muturi, Blessing Agyei Kyem, Joshua Kofi Asamoah et al.

Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.

CVOct 13, 2025
Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis

Blessing Agyei Kyem, Neema Jakisa Owor, Andrews Danyo et al.

Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80\%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6\% in VQA accuracy while maintaining captioning quality.

28.7CVApr 9
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh et al.

General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.

CVOct 12, 2025
Self-Supervised Multi-Scale Transformer with Attention-Guided Fusion for Efficient Crack Detection

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh et al.

Pavement crack detection has long depended on costly and time-intensive pixel-level annotations, which limit its scalability for large-scale infrastructure monitoring. To overcome this barrier, this paper examines the feasibility of achieving effective pixel-level crack segmentation entirely without manual annotations. Building on this objective, a fully self-supervised framework, Crack-Segmenter, is developed, integrating three complementary modules: the Scale-Adaptive Embedder (SAE) for robust multi-scale feature extraction, the Directional Attention Transformer (DAT) for maintaining linear crack continuity, and the Attention-Guided Fusion (AGF) module for adaptive feature integration. Through evaluations on ten public datasets, Crack-Segmenter consistently outperforms 13 state-of-the-art supervised methods across all major metrics, including mean Intersection over Union (mIoU), Dice score, XOR, and Hausdorff Distance (HD). These findings demonstrate that annotation-free crack detection is not only feasible but also superior, enabling transportation agencies and infrastructure managers to conduct scalable and cost-effective monitoring. This work advances self-supervised learning and motivates pavement cracks detection research.

CVJun 14, 2025
Demographics-Informed Neural Network for Multi-Modal Spatiotemporal forecasting of Urban Growth and Travel Patterns Using Satellite Imagery

Eugene Kofi Okrah Denteh, Andrews Danyo, Joshua Kofi Asamoah et al.

This study presents a novel demographics informed deep learning framework designed to forecast urban spatial transformations by jointly modeling geographic satellite imagery, socio-demographics, and travel behavior dynamics. The proposed model employs an encoder-decoder architecture with temporal gated residual connections, integrating satellite imagery and demographic data to accurately forecast future spatial transformations. The study also introduces a demographics prediction component which ensures that predicted satellite imagery are consistent with demographic features, significantly enhancing physiological realism and socioeconomic accuracy. The framework is enhanced by a proposed multi-objective loss function complemented by a semantic loss function that balances visual realism with temporal coherence. The experimental results from this study demonstrate the superior performance of the proposed model compared to state-of-the-art models, achieving higher structural similarity (SSIM: 0.8342) and significantly improved demographic consistency (Demo-loss: 0.14 versus 0.95 and 0.96 for baseline models). Additionally, the study validates co-evolutionary theories of urban development, demonstrating quantifiable bidirectional influences between built environment characteristics and population patterns. The study also contributes a comprehensive multimodal dataset pairing satellite imagery sequences (2012-2023) with corresponding demographic and travel behavior attributes, addressing existing gaps in urban and transportation planning resources by explicitly connecting physical landscape evolution with socio-demographic patterns.