Amirreza Rouhi

CV
h-index3
5papers
5citations
Novelty45%
AI Score46

5 Papers

64.6CVMar 31
PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy et al.

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism

53.5ROMay 10
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

Narsimha Menga, Parikshit Sakurikar, Amirreza Rouhi et al.

Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber

CVNov 18, 2023
Learning Scene Context Without Images

Amirreza Rouhi, David Han

Teaching machines of scene contextual knowledge would enable them to interact more effectively with the environment and to anticipate or predict objects that may not be immediately apparent in their perceptual field. In this paper, we introduce a novel transformer-based approach called $LMOD$ ( Label-based Missing Object Detection) to teach scene contextual knowledge to machines using an attention mechanism. A distinctive aspect of the proposed approach is its reliance solely on labels from image datasets to teach scene context, entirely eliminating the need for the actual image itself. We show how scene-wide relationships among different objects can be learned using a self-attention mechanism. We further show that the contextual knowledge gained from label based learning can enhance performance of other visual based object detection algorithm.

CVAug 5, 2025
LRDDv2: Enhanced Long-Range Drone Detection Dataset with Range Information and Comprehensive Real-World Challenges

Amirreza Rouhi, Sneh Patel, Noah McCarthy et al.

The exponential growth in Unmanned Aerial Vehicles (UAVs) usage underscores the critical need of detecting them at extended distances to ensure safe operations, especially in densely populated areas. Despite the tremendous advances made in computer vision through deep learning, the detection of these small airborne objects remains a formidable challenge. While several datasets have been developed specifically for drone detection, the need for a more extensive and diverse collection of drone image data persists, particularly for long-range detection under varying environmental conditions. We introduce here the Long Range Drone Detection (LRDD) Version 2 dataset, comprising 39,516 meticulously annotated images, as a second release of the LRDD dataset released previously. The LRDDv2 dataset enhances the LRDDv1 by incorporating a greater variety of images, providing a more diverse and comprehensive resource for drone detection research. What sets LRDDv2 apart is its inclusion of target range information for over 8,000 images, making it possible to develop algorithms for drone range estimation. Tailored for long-range aerial object detection, the majority of LRDDv2's dataset consists of images capturing drones with 50 or fewer pixels in 1080p resolution. For access to the complete Long-Range Drone Detection Dataset (LRDD)v2, please visit https://research.coe.drexel.edu/ece/imaple/lrddv2/ .

CVJun 10, 2025
ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations

Amirreza Rouhi, Solmaz Arezoomandan, Knut Peterson et al.

Object detection models typically rely on predefined categories, limiting their ability to identify novel objects in open-world scenarios. To overcome this constraint, we introduce ADAM: Autonomous Discovery and Annotation Model, a training-free, self-refining framework for open-world object labeling. ADAM leverages large language models (LLMs) to generate candidate labels for unknown objects based on contextual information from known entities within a scene. These labels are paired with visual embeddings from CLIP to construct an Embedding-Label Repository (ELR) that enables inference without category supervision. For a newly encountered unknown object, ADAM retrieves visually similar instances from the ELR and applies frequency-based voting and cross-modal re-ranking to assign a robust label. To further enhance consistency, we introduce a self-refinement loop that re-evaluates repository labels using visual cohesion analysis and k-nearest-neighbor-based majority re-labeling. Experimental results on the COCO and PASCAL datasets demonstrate that ADAM effectively annotates novel categories using only visual and contextual signals, without requiring any fine-tuning or retraining.