Josef Spjut

h-index25

11papers

102citations

Novelty39%

AI Score38

Ranked #108,636 of 201,326 authors (top 54%)#35,241 in CV (top 60%)

11 Papers

CVApr 16, 2021Code

Noise-Aware Video Saliency Prediction

Ekta Prashnani, Orazio Gallo, Joohwan Kim et al.

We tackle the problem of predicting saliency maps for videos of dynamic scenes. We note that the accuracy of the maps reconstructed from the gaze data of a fixed number of observers varies with the frame, as it depends on the content of the scene. This issue is particularly pressing when a limited number of observers are available. In such cases, directly minimizing the discrepancy between the predicted and measured saliency maps, as traditional deep-learning methods do, results in overfitting to the noisy data. We propose a noise-aware training (NAT) paradigm that quantifies and accounts for the uncertainty arising from frame-specific gaze data inaccuracy. We show that NAT is especially advantageous when limited training data is available, with experiments across different models, loss functions, and datasets. We also introduce a video game-based saliency dataset, with rich temporal semantics, and multiple gaze attractors per frame. The dataset and source code are available at https://github.com/NVlabs/NAT-saliency.

ROOct 20, 2024

GRS: Generating Robotic Simulation Tasks from Real-World Images

Alex Zook, Fan-Yun Sun, Josef Spjut et al.

We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.

CVMay 1, 2024

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Shengze Wang, Xueting Li, Chao Liu et al.

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.

CVSep 29, 2025

Editing Physiological Signals in Videos Using Latent Representations

Tianwen Zhou, Akshay Paruchuri, Josef Spjut et al. · stanford

Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design's controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.

AIJul 16, 2025

Fly, Fail, Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models

Alex Zook, Josef Spjut, Jonathan Tremblay

Game design hinges on understanding how static rules and content translate into dynamic player behavior - something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing (i) numerical play metrics and/or (ii) a compact image strip summarising recent video frames. The LMM designer receives a gameplay goal and the current game configuration, analyses the play traces, and edits the configuration to steer future behaviour toward the goal. We demonstrate results that LMMs can reason over behavioral traces supplied by RL agents to iteratively refine game mechanics, pointing toward practical, scalable tools for AI-assisted game design.

HCFeb 10, 2022

Experimental Augmented Reality User Experience

Josef Spjut, Fengyuan Zhu, Xiaolei Huang et al.

Augmented Reality (AR) is an emerging field ripe for experimentation, especially when it comes to developing the kinds of applications and experiences that will drive mass adoption of the technology. While we aren't aware of any current consumer product that realize a wearable, wide Field of View (FoV), AR Head Mounted Display (HMD), such devices will certainly come. In order for these sophisticated, likely high-cost hardware products to succeed, it is important they provide a high quality user experience. To that end, we prototyped 4 experimental applications for wide FoV displays that will likely exist in the future. Given current AR HMD limitations, we used a AR simulator built on web technology and VR headsets to demonstrate these applications, allowing users and designers to peer into the future.

HCFeb 10, 2022

FirstPersonScience: Quantifying Psychophysics for First Person Shooter Tasks

Josef Spjut, Ben Boudaoud, Kamran Binaee et al.

In the emerging field of esports research, there is an increasing demand for quantitative results that can be used by players, coaches and analysts to make decisions and present meaningful commentary for spectators. We present FirstPersonScience, a software application intended to fill this need in the esports community by allowing scientists to design carefully controlled experiments and capture accurate results in the First Person Shooter esports genre. An experiment designer can control a variety of parameters including target motion, weapon configuration, 3D scene, frame rate, and latency. Furthermore, we validate this application through careful end-to-end latency analysis and provide a case study showing how it can be used to demonstrate the training effect of one user given repeated task performance.

HCMay 5, 2021

A Case Study of First Person Aiming at Low Latency for Esports

Josef Spjut, Ben Boudaoud, Joohwan Kim

Lower computer system input-to-output latency substantially reduces many task completion times. In fact, literature shows that reduction in targeting task completion time from decreased latency often exceeds the decrease in latency alone. However, for aiming in first person shooter (FPS) games, some prior work has demonstrated diminishing returns below 40 ms of local input-to-output computer system latency. In this paper, we review this prior art and provide an additional case study with data demonstrating the importance of local system latency improvement, even at latency values below 20 ms. Though other factors may determine victory in a particular esports challenge, ensuring balanced local computer latency among competitors is essential to fair competition.

CVAug 12, 2020

SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition with Distractors

Mona Jalal, Josef Spjut, Ben Boudaoud et al.

We present a new, publicly-available image dataset generated by the NVIDIA Deep Learning Data Synthesizer intended for use in object detection, pose estimation, and tracking applications. This dataset contains 144k stereo image pairs that synthetically combine 18 camera viewpoints of three photorealistic virtual environments with up to 10 objects (chosen randomly from the 21 object models of the YCB dataset [1]) and flying distractors. Object and camera pose, scene lighting, and quantity of objects and distractors were randomized. Each provided view includes RGB, depth, segmentation, and surface normal images, all pixel level. We describe our approach for domain randomization and provide insight into the decisions that produced the dataset.

GRMay 3, 2019

Toward Standardized Classification of Foveated Displays

Josef Spjut, Ben Boudaoud, Jonghyun Kim et al.

Emergent in the field of head mounted display design is a desire to leverage the limitations of the human visual system to reduce the computation, communication, and display workload in power and form-factor constrained systems. Fundamental to this reduced workload is the ability to match display resolution to the acuity of the human visual system, along with a resulting need to follow the gaze of the eye as it moves, a process referred to as foveation. A display that moves its content along with the eye may be called a Foveated Display, though this term is also commonly used to describe displays with non-uniform resolution that attempt to mimic human visual acuity. We therefore recommend a definition for the term Foveated Display that accepts both of these interpretations. Furthermore, we include a simplified model for human visual Acuity Distribution Functions (ADFs) at various levels of visual acuity, across wide fields of view and propose comparison of this ADF with the Resolution Distribution Function of a foveated display for evaluation of its resolution at a particular gaze direction. We also provide a taxonomy to allow the field to meaningfully compare and contrast various aspects of foveated displays in a display and optical technology-agnostic manner.

HCJun 8, 2017

A Deformable Interface for Human Touch Recognition using Stretchable Carbon Nanotube Dielectric Elastomer Sensors and Deep Neural Networks

Chris Larson, Josef Spjut, Ross Knepper et al.

User interfaces provide an interactive window between physical and virtual environments. A new concept in the field of human-computer interaction is a soft user interface; a compliant surface that facilitates touch interaction through deformation. Despite the potential of these interfaces, they currently lack a signal processing framework that can efficiently extract information from their deformation. Here we present OrbTouch, a device that uses statistical learning algorithms, based on convolutional neural networks, to map deformations from human touch to categorical labels (i.e., gestures) and touch location using stretchable capacitor signals as inputs. We demonstrate this approach by using the device to control the popular game Tetris. OrbTouch provides a modular, robust framework to interpret deformation in soft media, laying a foundation for new modes of human computer interaction through shape changing solids.