QUANT-PHMay 5, 2022
Quantum Extremal LearningSavvas Varsamopoulos, Evan Philip, Herman W. T. van Vlijmen et al.
We propose a quantum algorithm for `extremal learning', which is the process of finding the input to a hidden function that extremizes the function output, without having direct access to the hidden function, given only partial input-output (training) data. The algorithm, called quantum extremal learning (QEL), consists of a parametric quantum circuit that is variationally trained to model data input-output relationships and where a trainable quantum feature map, that encodes the input data, is analytically differentiated in order to find the coordinate that extremizes the model. This enables the combination of established quantum machine learning modelling with established quantum optimization, on a single circuit/quantum computer. We have tested our algorithm on a range of classical datasets based on either discrete or continuous input variables, both of which are compatible with the algorithm. In case of discrete variables, we test our algorithm on synthetic problems formulated based on Max-Cut problem generators and also considering higher order correlations in the input-output relationships. In case of the continuous variables, we test our algorithm on synthetic datasets in 1D and simple ordinary differential functions. We find that the algorithm is able to successfully find the extremal value of such problems, even when the training dataset is sparse or a small fraction of the input configuration space. We additionally show how the algorithm can be used for much more general cases of higher dimensionality, complex differential equations, and with full flexibility in the choice of both modeling and optimization ansatz. We envision that due to its general framework and simple construction, the QEL algorithm will be able to solve a wide variety of applications in different fields, opening up areas of further research.
44.2CVMay 25
RadarSim: Simulating Single-Chip Radar via Multimodal Neural FieldsChuhan Chen, Tianshu Huang, Akarsh Prabhakara et al.
Radars are an ideal complement to cameras: both are inexpensive, solid-state sensors, with cameras offering fine angular resolution, while radars provide metric depth and robustness under adverse weather. However, radar data is more difficult to interpret than camera images and varies significantly between sensors, necessitating increased reliance on simulation for prototyping sensors and processing pipelines. Recent work treating radar reconstruction as a novel view synthesis problem has shown great promise in reconstructing radar-relevant geometry and simulating low-level radar data. However, such methods are constrained by the low spatial resolution of the underlying radar. To address this, we propose a unified differentiable renderer, RadarSim, which leverages the high angular resolution of RGB cameras to generate Doppler radar range images from a camera-initialized neural field. Using a novel data set of calibrated radar camera recordings from a custom hand-held rig, we demonstrate that RadarSim produces sharper geometry and Doppler range frames than radar-only reconstructions.
65.1CVMay 9
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language ModelsSagar Bharadwaj, Ziyong Ma, Anurag Ghosh et al.
3D scene understanding spans reasoning about free space, object grounding, hypothetical object insertions, complex geometric relationships, and integrating all of these with external tools and data sources. Existing 3D understanding methods typically rely on large-scale 3D-language training or focus on object grounding and simple spatial relationships. We argue that the broad generalization that motivates 3D-language training can be achieved at inference time, without 3D-specific training. We propose Flame3D, a training-free framework that represents scenes as editable visual-textual 3D memories and exposes them to an off-the-shelf MLLM through composable spatial tools. Flame3D also lets the agent synthesize custom spatial programs at inference time, enabling open-ended reasoning over layouts, empty space, and objects not yet present in the scene. External data and corrections can be added to the memory without retraining. In addition to showing competitive performance to finetuned 3D-LMM methods on ScanQA, we study multi-hop 3D reasoning capabilities of Flame3D by evaluating it on a curated compositional spatial-reasoning benchmark, Compose3D. We find that fixed tools fall short and that the agent's ability to synthesize spatial operations at inference time is essential. These results invite the question: should future progress in 3D scene understanding focus on richer scene memories and expressive compositional abstractions?
52.3ROApr 15
UNRIO: Uncertainty-Aware Velocity Learning for Radar-Inertial OdometryJui-Te Huang, Tinashu Huang, Anthony Rowe et al.
We present UNRIO, an uncertainty-aware radar-inertial odometry system that estimates ego-velocity directly from raw mmWave radar IQ signals rather than processed point clouds. Existing radar-inertial odometry methods rely on handcrafted signal processing pipelines that discard latent information in the raw spectrum and require careful parameter tuning. To address this, we propose a transformer-based neural network built on the GRT architecture that processes the full 4-D spectral cube to predict body-frame velocity in two modes: a direct linear velocity estimate and a per-anglebin Doppler velocity map. The network is trained in three stages: geometric pretraining on LiDAR-projected depth, velocity or Doppler fine-tuning, and uncertainty calibration via negative log-likelihood loss, enabling it to produce uncertainty estimates alongside its predictions. These uncertainty estimates are propagated into a sliding-window pose graph that fuses radar velocity factors with IMU preintegration measurements. We train and evaluate UNRIO on the IQ1M dataset across diverse indoor environments with both forward and lateral motion patterns unseen during training. Our method achieves the lowest relative pose error on the majority of sequences, with particularly strong gains over classical DSP baselines on Lateral-motion trajectories where sparse point clouds degrade conventional velocity estimators.
CVMar 6, 2024
DART: Implicit Doppler Tomography for Radar Novel View SynthesisTianshu Huang, John Miller, Akarsh Prabhakara et al.
Simulation is an invaluable tool for radio-frequency system designers that enables rapid prototyping of various algorithms for imaging, target detection, classification, and tracking. However, simulating realistic radar scans is a challenging task that requires an accurate model of the scene, radio frequency material properties, and a corresponding radar synthesis function. Rather than specifying these models explicitly, we propose DART - Doppler Aided Radar Tomography, a Neural Radiance Field-inspired method which uses radar-specific physics to create a reflectance and transmittance-based rendering pipeline for range-Doppler images. We then evaluate DART by constructing a custom data collection platform and collecting a novel radar dataset together with accurate position and instantaneous velocity measurements from lidar-based localization. In comparison to state-of-the-art baselines, DART synthesizes superior radar range-Doppler images from novel views across all datasets and additionally can be used to generate high quality tomographic images.
CVOct 4, 2025
OpenFLAME: Federated Visual Positioning System to Enable Large-Scale Augmented Reality ApplicationsSagar Bharadwaj, Harrison Williams, Luke Wang et al.
World-scale augmented reality (AR) applications need a ubiquitous 6DoF localization backend to anchor content to the real world consistently across devices. Large organizations such as Google and Niantic are 3D scanning outdoor public spaces in order to build their own Visual Positioning Systems (VPS). These centralized VPS solutions fail to meet the needs of many future AR applications -- they do not cover private indoor spaces because of privacy concerns, regulations, and the labor bottleneck of updating and maintaining 3D scans. In this paper, we present OpenFLAME, a federated VPS backend that allows independent organizations to 3D scan and maintain a separate VPS service for their own spaces. This enables access control of indoor 3D scans, distributed maintenance of the VPS backend, and encourages larger coverage. Sharding of VPS services introduces several unique challenges -- coherency of localization results across spaces, quality control of VPS services, selection of the right VPS service for a location, and many others. We introduce the concept of federated image-based localization and provide reference solutions for managing and merging data across maps without sharing private data.
CVSep 15, 2025
Towards Foundational Models for Single-Chip RadarTianshu Huang, Akarsh Prabhakara, Chuhan Chen et al.
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets. In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20\% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we roughly estimate that $\approx$100M samples (3000 hours) of data are required to fully exploit the potential of GRT.
LGMar 9, 2025
Interference-Aware Edge Runtime Prediction with Conformal Matrix CompletionTianshu Huang, Arjun Ramesh, Emily Ruppel et al.
Accurately estimating workload runtime is a longstanding goal in computer systems, and plays a key role in efficient resource provisioning, latency minimization, and various other system management tasks. Runtime prediction is particularly important for managing increasingly complex distributed systems in which more sophisticated processing is pushed to the edge in search of better latency. Previous approaches for runtime prediction in edge systems suffer from poor data efficiency or require intensive instrumentation; these challenges are compounded in heterogeneous edge computing environments, where historical runtime data may be sparsely available and instrumentation is often challenging. Moreover, edge computing environments often feature multi-tenancy due to limited resources at the network edge, potentially leading to interference between workloads and further complicating the runtime prediction problem. Drawing from insights across machine learning and computer systems, we design a matrix factorization-inspired method that generates accurate interference-aware predictions with tight provably-guaranteed uncertainty bounds. We validate our method on a novel WebAssembly runtime dataset collected from 24 unique devices, achieving a prediction error of 5.2% -- 2x better than a naive application of existing methods.
CVJun 15, 2021
A Hybrid mmWave and Camera System for Long-Range Depth ImagingAkarsh Prabhakara, Diana Zhang, Chao Li et al.
mmWave radars offer excellent depth resolution even at very long ranges owing to their high bandwidth. But their angular resolution is at least an order-of-magnitude worse than camera and lidar systems. Hence, mmWave radar is not a capable 3-D imaging solution in isolation. We propose Metamoran, a system that combines the complimentary strengths of radar and camera to obtain accurate, high resolution depth images over long ranges even in high clutter environments, all from a single fixed vantage point. Metamoran enables rich long-range depth imaging with applications in security and surveillance, roadside safety infrastructure and wide-area mapping. Our approach leverages the high angular resolution from cameras using computer vision techniques, including image segmentation and monocular depth estimation, to obtain object shape. Our core contribution is a method to convert this object shape into an RF I/Q equivalent, which we use in a novel radar processing pipeline to help declutter the scene and capture extremely weak reflections from objects at long distances. We perform a detailed evaluation of Metamoran's depth imaging capabilities in 400 diverse scenes. Our evaluation shows that Metamoran estimates the depth of static objects up to 90 m and moving objects up to 305 m and with a median error of 28 cm, an improvement of 13$\times$ compared to a naive radar+camera baseline and 23$\times$ compared to monocular depth estimation.
CVNov 25, 2015
Tracking Motion and Proxemics using Thermal-sensor ArrayChandrayee Basu, Anthony Rowe
Indoor tracking has all-pervasive applications beyond mere surveillance, for example in education, health monitoring, marketing, energy management and so on. Image and video based tracking systems are intrusive. Thermal array sensors on the other hand can provide coarse-grained tracking while preserving privacy of the subjects. The goal of the project is to facilitate motion detection and group proxemics modeling using an 8 x 8 infrared sensor array. Each of the 8 x 8 pixels is a temperature reading in Fahrenheit. We refer to each 8 x 8 matrix as a scene. We collected approximately 902 scenes with different configurations of human groups and different walking directions. We infer direction of motion of a subject across a set of scenes as left-to-right, right-to-left, up-to-down and down-to-up using cross-correlation analysis. We used features from connected component analysis of each background subtracted scene and performed Support Vector Machine classification to estimate number of instances of human subjects in the scene.