Kaicong Huang

CV
h-index6
7papers
15citations
Novelty53%
AI Score50

7 Papers

33.9CVMay 11Code
iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning

Kaicong Huang, Weiheng Oh, Thomas Guggisberg et al.

Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45\% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.

52.4QUANT-PHMay 1
Impact-Driven Quantum Decomposition for Traffic Zone Partitioning: A Hybrid Gate-Model Framework

Ruimin Ke, Talha Azfar, Kaicong Huang et al.

Partitioning transportation networks into balanced and spatially coherent traffic zones is a fundamental yet computationally challenging task in intelligent transportation systems. The resulting optimization problem exhibits dense interactions among decision variables and can be formulated as a Quadratic Unconstrained Binary Optimization (QUBO) model. While quantum optimization naturally aligns with such quadratic energy representations, current noisy intermediate-scale quantum hardware imposes limitations on problem size, connectivity, and circuit reliability. This paper proposes an impact-driven hybrid quantum--classical optimization framework for traffic zone partitioning that bridges transportation-scale optimization models and practical gate-based quantum processors. Instead of static geographic decomposition, the method estimates the energy impact of decision variables and selectively assigns quantum computation to influential subproblems while a classical coordination loop maintains global feasibility. The framework is implemented using the Iskay optimizer and evaluated on the IBM Quantum System One backend. Experiments compare direct quantum optimization, classical iterative SubQUBO refinement, and the proposed hybrid approach. Results show that impact-guided decomposition improves convergence behavior and produces more coherent spatial partitions relative to classical refinement, while remaining consistent with hardware constraints. Although the hybrid method does not outperform the best direct quantum solution, it demonstrates a practical pathway toward scalable hybrid optimization for transportation applications under current quantum hardware conditions.

35.9CVMay 13
Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

Kaicong Huang, Talha Azfar, Weisong Shi et al.

Reliable autonomous driving relies on large-scale, well-labeled data and robust models. However, manual data collection is resource-intensive, and traditional simulation suffers from a persistent reality gap. While recent generative frameworks and radiance-field methods improve visual fidelity, they still struggle with temporal and spatial consistency and cannot ensure physics-aware behavior, limiting their applicability to driving scenario generation. To address these challenges, we propose Real2Sim, an unified framework that combines 4D Gaussian Splatting (4DGS) with a differentiable Material Point Method (MPM) solver. Real2Sim explicitly reconstructs dynamic driving scenes as temporally continuous Gaussian primitives, supports instance-level editing, and simulates realistic object-object and object-environment interactions. This framework enables physics-aware, high-fidelity synthesis of diverse, editable scenarios, including challenging corner cases such as collisions and post-impact trajectories. Experiments on the Waymo Open Dataset validate Real2Sim's capabilities in rendering, reconstruction, editing, and physics simulation, demonstrating its potential as a scalable tool for data generation in downstream tasks such as perception, tracking, trajectory prediction, and end-to-end policy learning.

ETJan 25, 2025
Enhancing Disaster Resilience with UAV-Assisted Edge Computing: A Reinforcement Learning Approach to Managing Heterogeneous Edge Devices

Talha Azfar, Kaicong Huang, Ruimin Ke

Edge sensing and computing is rapidly becoming part of intelligent infrastructure architecture leading to operational reliance on such systems in disaster or emergency situations. In such scenarios there is a high chance of power supply failure due to power grid issues, and communication system issues due to base stations losing power or being damaged by the elements, e.g., flooding, wildfires etc. Mobile edge computing in the form of unmanned aerial vehicles (UAVs) has been proposed to provide computation offloading from these devices to conserve their battery, while the use of UAVs as relay network nodes has also been investigated previously. This paper considers the use of UAVs with further constraints on power and connectivity to prolong the life of the network while also ensuring that the data is received from the edge nodes in a timely manner. Reinforcement learning is used to investigate numerous scenarios of various levels of power and communication failure. This approach is able to identify the device most likely to fail in a given scenario, thus providing priority guidance for maintenance personnel. The evacuations of a rural town and urban downtown area are also simulated to demonstrate the effectiveness of the approach at extending the life of the most critical edge devices.

CVApr 15, 2025
TransitReID: Transit OD Data Collection with Occlusion-Resistant Dynamic Passenger Re-Identification

Kaicong Huang, Talha Azfar, Jack Reilly et al.

Transit Origin-Destination (OD) data are fundamental for optimizing public transit services, yet current collection methods, such as manual surveys, Bluetooth and WiFi tracking, or Automated Passenger Counters, are either costly, device-dependent, or incapable of individual-level matching. Meanwhile, onboard surveillance cameras already deployed on most transit vehicles provide an underutilized opportunity for automated OD data collection. Leveraging this, we present TransitReID, a novel framework for individual-level and occlusion-resistant passenger re-identification tailored to transit environments. Our approach introduces three key innovations: (1) an occlusion-robust ReID algorithm that integrates a variational autoencoder-guided region-attention mechanism and selective quality feature averaging to dynamically emphasize visible and discriminative body regions under severe occlusions and viewpoint variations; (2) a Hierarchical Storage and Dynamic Matching HSDM mechanism that transforms static gallery matching into a dynamic process for robustness, accuracy, and speed in real-world bus operations; and (3) a multi-threaded edge implementation that enables near real-time OD estimation while ensuring privacy by processing all data locally. To support research in this domain, we also construct a new TransitReID dataset with over 17,000 images captured from bus front and rear cameras under diverse occlusion and viewpoint conditions. Experimental results demonstrate that TransitReID achieves state-of-the-art performance, with R-1 accuracy of 88.3 percent and mAP of 92.5 percent, and further sustains 90 percent OD estimation accuracy in bus route simulations on NVIDIA Jetson edge devices. This work advances both the algorithmic and system-level foundations of automated transit OD collection, paving the way for scalable, privacy-preserving deployment in intelligent transportation systems.

CVSep 3, 2025
Background Matters Too: A Language-Enhanced Adversarial Framework for Person Re-Identification

Kaicong Huang, Talha Azfar, Jack M. Reilly et al.

Person re-identification faces two core challenges: precisely locating the foreground target while suppressing background noise and extracting fine-grained features from the target region. Numerous visual-only approaches address these issues by partitioning an image and applying attention modules, yet they rely on costly manual annotations and struggle with complex occlusions. Recent multimodal methods, motivated by CLIP, introduce semantic cues to guide visual understanding. However, they focus solely on foreground information, but overlook the potential value of background cues. Inspired by human perception, we argue that background semantics are as important as the foreground semantics in ReID, as humans tend to eliminate background distractions while focusing on target appearance. Therefore, this paper proposes an end-to-end framework that jointly models foreground and background information within a dual-branch cross-modal feature extraction pipeline. To help the network distinguish between the two domains, we propose an intra-semantic alignment and inter-semantic adversarial learning strategy. Specifically, we align visual and textual features that share the same semantics across domains, while simultaneously penalizing similarity between foreground and background features to enhance the network's discriminative power. This strategy drives the model to actively suppress noisy background regions and enhance attention toward identity-relevant foreground cues. Comprehensive experiments on two holistic and two occluded ReID benchmarks demonstrate the effectiveness and generality of the proposed method, with results that match or surpass those of current state-of-the-art approaches.

SYDec 5, 2024
Traffic Co-Simulation Framework Empowered by Infrastructure Camera Sensing and Reinforcement Learning

Talha Azfar, Kaicong Huang, Andrew Tracy et al.

Traffic simulations are commonly used to optimize urban traffic flow, with reinforcement learning (RL) showing promising potential for automated traffic signal control, particularly in intelligent transportation systems involving connected automated vehicles. Multi-agent reinforcement learning (MARL) is particularly effective for learning control strategies for traffic lights in a network using iterative simulations. However, existing methods often assume perfect vehicle detection, which overlooks real-world limitations related to infrastructure availability and sensor reliability. This study proposes a co-simulation framework integrating CARLA and SUMO, which combines high-fidelity 3D modeling with large-scale traffic flow simulation. Cameras mounted on traffic light poles within the CARLA environment use a YOLO-based computer vision system to detect and count vehicles, providing real-time traffic data as input for adaptive signal control in SUMO. MARL agents trained with four different reward structures leverage this visual feedback to optimize signal timings and improve network-wide traffic flow. Experiments in a multi-intersection test-bed demonstrate the effectiveness of the proposed MARL approach in enhancing traffic conditions using real-time camera based detection. The framework also evaluates the robustness of MARL under faulty or sparse sensing and compares the performance of YOLOv5 and YOLOv8 for vehicle detection. Results show that while better accuracy improves performance, MARL agents can still achieve significant improvements with imperfect detection, demonstrating scalability and adaptability for real-world scenarios.