CVApr 10, 2025
SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen VideosJoshua Li, Fernando Jose Pena Cantu, Emily Yu et al.
Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.
LGDec 14, 2025
Network Level Evaluation of Hangup Susceptibility of HRGCs using Deep Learning and Sensing Techniques: A Goal Towards Safer FutureKaustav Chatterjee, Joshua Li, Kundan Parajulee et al.
Steep-profiled Highway Railway Grade Crossings (HRGCs) pose safety hazards to vehicles with low ground clearance, which may become stranded on the tracks, creating risks of train vehicle collisions. This research develops a framework for network level evaluation of hang-up susceptibility of HRGCs. Profile data from different crossings in Oklahoma were collected using both a walking profiler and the Pave3D8K Laser Imaging System. A hybrid deep learning model, combining Long Short Term Memory (LSTM) and Transformer architectures, was developed to reconstruct accurate HRGC profiles from Pave3D8K Laser Imaging System data. Vehicle dimension data from around 350 specialty vehicles were collected at various locations across Oklahoma to enable up-to-date statistical design dimensions. Hang-up susceptibility was analyzed using three vehicle dimension scenarios: (a) median dimension (median wheelbase and ground clearance), (b) 75-25 percentile dimension (75 percentile wheelbase, 25 percentile ground clearance), and (c) worst case dimension (maximum wheelbase and minimum ground clearance). Results indicate 70, 80, and 95 crossings at the highest hang-up risk levels under these scenarios, respectively. An ArcGIS database and a software interface were developed to support transportation agencies in mitigating crossing hazards. This framework advances safety evaluation by integrating next-generation sensing, deep learning, and infrastructure datasets into practical decision support tools.
CVOct 17, 2025
SHARE: Scene-Human Aligned ReconstructionJoshua Li, Brendan Chharawala, Chang Shu et al.
Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry's inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human's positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.