Aaron Parness

CVFeb 12

Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots

Lijun Zhang, Nikhil Chacko, Petter Nilsson et al.

Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow-intent-conditioned world model that represents bin states as item-aligned instance masks and uses a latent diffusion transformer to predict the post-stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post-stow layouts compared with heuristic baselines. We further evaluate the predicted post-stow layouts in two downstream tasks, in which replacing the real post-stow masks with FOREST predictions causes only modest performance loss in load-quality assessment and multi-stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.

CVAug 27, 2019

Improving Visual Feature Extraction in Glacial Environments

Steven D. Morad, Jeremy Nash, Shoya Higa et al.

Glacial science could benefit tremendously from autonomous robots, but previous glacial robots have had perception issues in these colorless and featureless environments, specifically with visual feature extraction. This translates to failures in visual odometry and visual navigation. Glaciologists use near-infrared imagery to reveal the underlying heterogeneous spatial structure of snow and ice, and we theorize that this hidden near-infrared structure could produce more and higher quality features than available in visible light. We took a custom camera rig to Igloo Cave at Mt. St. Helens to test our theory. The camera rig contains two identical machine vision cameras, one which was outfitted with multiple filters to see only near-infrared light. We extracted features from short video clips taken inside Igloo Cave at Mt. St. Helens, using three popular feature extractors (FAST, SIFT, and SURF). We quantified the number of features and their quality for visual navigation by comparing the resulting orientation estimates to ground truth. Our main contribution is the use of NIR longpass filters to improve the quantity and quality of visual features in icy terrain, irrespective of the feature extractor used.

Aaron Parness

2 Papers