Ming C. Lin

CV
h-index60
33papers
652citations
Novelty59%
AI Score59

33 Papers

LGMay 3, 2022
Differentiable Simulation of Soft Multi-body Systems

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun et al.

We present a method for differentiable simulation of soft articulated bodies. Our work enables the integration of differentiable physical dynamics into gradient-based pipelines. We develop a top-down matrix assembly algorithm within Projective Dynamics and derive a generalized dry friction model for soft continuum using a new matrix splitting strategy. We derive a differentiable control framework for soft articulated bodies driven by muscles, joint torques, or pneumatic tubes. The experiments demonstrate that our designs make soft body simulation more stable and realistic compared to other frameworks. Our method accelerates the solution of system identification problems by more than an order of magnitude, and enables efficient gradient-based learning of motion control with soft robots.

CVOct 11, 2022
Human Body Measurement Estimation with Adversarial Augmentation

Nataniel Ruiz, Miriam Bellver, Timo Bolkart et al. · amazon-science

We present a Body Measurement network (BMnet) for estimating 3D anthropomorphic measurements of the human body shape from silhouette images. Training of BMnet is performed on data from real human subjects, and augmented with a novel adversarial body simulator (ABS) that finds and synthesizes challenging body shapes. ABS is based on the skinned multiperson linear (SMPL) body model, and aims to maximize BMnet measurement prediction error with respect to latent SMPL shape parameters. ABS is fully differentiable with respect to these parameters, and trained end-to-end via backpropagation with BMnet in the loop. Experiments show that ABS effectively discovers adversarial examples, such as bodies with extreme body mass indices (BMI), consistent with the rarity of extreme-BMI bodies in BMnet's training set. Thus ABS is able to reveal gaps in training data and potential failures in predicting under-represented body shapes. Results show that training BMnet with ABS improves measurement prediction accuracy on real bodies by up to 10%, when compared to no augmentation or random body shape sampling. Furthermore, our method significantly outperforms SOTA measurement estimation methods by as much as 3x. Finally, we release BodyM, the first challenging, large-scale dataset of photo silhouettes and body measurements of real human subjects, to further promote research in this area. Project website: https://adversarialbodysim.github.io

CVOct 22, 2022
NeuPhysics: Editable Neural Geometry and Physics from Monocular Videos

Yi-Ling Qiao, Alexander Gao, Ming C. Lin

We present a method for learning 3D geometry and physics parameters of a dynamic scene from only a monocular RGB video input. To decouple the learning of underlying scene geometry from dynamic motion, we represent the scene as a time-invariant signed distance function (SDF) which serves as a reference frame, along with a time-conditioned deformation field. We further bridge this neural geometry representation with a differentiable physics simulator by designing a two-way conversion between the neural field and its corresponding hexahedral mesh, enabling us to estimate physics parameters from the source video by minimizing a cycle consistency loss. Our method also allows a user to interactively edit 3D objects from the source video by modifying the recovered hexahedral mesh, and propagating the operation back to the neural field representation. Experiments show that our method achieves superior mesh and video reconstruction of dynamic scenes compared to competing Neural Field approaches, and we provide extensive examples which demonstrate its ability to extract useful 3D representations from videos captured with consumer-grade cameras.

GROct 14, 2022
Differentiable Hybrid Traffic Simulation

Sanghyun Son, Yi-Ling Qiao, Jason Sewall et al.

We introduce a novel differentiable hybrid traffic simulator, which simulates traffic using a hybrid model of both macroscopic and microscopic models and can be directly integrated into a neural network for traffic control and flow optimization. This is the first differentiable traffic simulator for macroscopic and hybrid models that can compute gradients for traffic states across time steps and inhomogeneous lanes. To compute the gradient flow between two types of traffic models in a hybrid framework, we present a novel intermediate conversion component that bridges the lanes in a differentiable manner as well. We also show that we can use analytical gradients to accelerate the overall process and enhance scalability. Thanks to these gradients, our simulator can provide more efficient and scalable solutions for complex learning and control problems posed in traffic engineering than other existing algorithms. Refer to https://sites.google.com/umd.edu/diff-hybrid-traffic-sim for our project.

GRSep 8, 2023
Dynamic Mesh-Aware Radiance Fields

Yi-Ling Qiao, Alexander Gao, Yiran Xu et al.

Embedding polygonal mesh assets within photorealistic Neural Radience Fields (NeRF) volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range (HDR) images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene.

CVDec 7, 2025
MeshSplatting: Differentiable Rendering with Opaque Meshes

Jan Held, Sanghyun Son, Renaud Vandeghen et al.

Primitive-based splatting methods like 3D Gaussian Splatting have revolutionized novel view synthesis with real-time rendering. However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering. By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines. On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction. The project page is available at https://meshsplatting.github.io/.

LGDec 14, 2023Code
Gradient Informed Proximal Policy Optimization

Sanghyun Son, Laura Yu Zheng, Ryan Sullivan et al.

We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. To incorporate analytical gradients into the PPO framework, we introduce the concept of an α-policy that stands as a locally superior policy. By adaptively modifying the α value, we can effectively manage the influence of analytical policy gradients during learning. To this end, we suggest metrics for assessing the variance and bias of analytical gradients, reducing dependence on these gradients when high variance or bias is detected. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments. Our code can be found online: https://github.com/SonSang/gippo.

CVAug 12, 2024
3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Taewon Kang, Divya Kothandaraman, Dinesh Manocha et al.

Recent 3D novel view synthesis (NVS) methods often require extensive 3D data for training, and also typically lack generalization beyond the training distribution. Moreover, they tend to be object centric and struggle with complex and intricate scenes. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without the need for a large amount of 3D-based training data, but lack camera control. In this paper, we introduce a method capable of generating camera-controlled viewpoints from a single input image, by combining the benefits of 3D-free and 3D-based approaches. Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis style approach, along with enriching the CLIP vision-language space with 3D camera angle information, to achieve the desired results. Experimental results demonstrate that our method outperforms existing models in both qualitative and quantitative evaluations, achieving high-fidelity, consistent novel view synthesis at desired camera angles across a wide variety of scenes while maintaining accurate, natural detail representation and image clarity across various viewpoints. We also support our method with a comprehensive analysis of 2D image generation models and the 3D space, providing a solid foundation and rationale for our solution.

ROMar 14
GM3: A General Physical Model for Micro-Mobility Vehicles

Grace Cai, Nithin Parepally, Laura Zheng et al.

Modeling the dynamics of micro-mobility vehicles (MMV) is becoming increasingly important for training autonomous vehicle systems and building urban traffic simulations. However, mainstream tools rely on variants of the Kinematic Bicycle Model (KBM) or mode-specific physics that miss tire slip, load transfer, and rider/vehicle lean. To our knowledge, no unified, physics-based model captures these dynamics across the full range of common MMVs and wheel layouts. We propose the "Generalized Micro-mobility Model" (GM3), a tire-level formulation based on the tire brush representation that supports arbitrary wheel configurations, including single/double track and multi-wheel platforms. We introduce an interactive model-agnostic simulation framework that decouples vehicle/layout specification from dynamics to compare the GM3 with the KBM and other models, consisting of fixed step RK4 integration, human-in-the-loop and scripted control, real-time trajectory traces and logging for analysis. We also empirically validate the GM3 on the Stanford Drone Dataset's deathCircle (roundabout) scene for biker, skater, and cart classes.

CVJun 3, 2024Code
Adaptive Sensitivity Analysis for Robust Augmentation against Natural Corruptions in Image Segmentation

Laura Zheng, Wenjie Wei, Tony Wu et al.

Achieving robustness in image segmentation models is challenging due to the fine-grained nature of pixel-level classification. These models, which are crucial for many real-time perception applications, particularly struggle when faced with natural corruptions in the wild for autonomous systems. While sensitivity analysis can help us understand how input variables influence model outputs, its application to natural and uncontrollable corruptions in training data is computationally expensive. In this work, we present an adaptive, sensitivity-guided augmentation method to enhance robustness against natural corruptions. Our sensitivity analysis on average runs 10x faster and requires about 200x less storage than previous sensitivity analysis, enabling practical, on-the-fly estimation during training for a model-free augmentation policy. With minimal fine-tuning, our sensitivity-guided augmentation method achieves improved robustness on both real-world and synthetic datasets compared to state-of-the-art data augmentation techniques in image segmentation. Code implementation for this work can be found at: https://github.com/laurayuzheng/SensAug.

RODec 21, 2024Code
Gradient-based Trajectory Optimization with Parallelized Differentiable Traffic Simulation

Sanghyun Son, Laura Zheng, Brian Clipp et al.

We present a parallelized differentiable traffic simulator based on the Intelligent Driver Model (IDM), a car-following framework that incorporates driver behavior as key variables. Our vehicle simulator efficiently models vehicle motion, generating trajectories that can be supervised to fit real-world data. By leveraging its differentiable nature, IDM parameters are optimized using gradient-based methods. With the capability to simulate up to 2 million vehicles in real time, the system is scalable for large-scale trajectory optimization. We show that we can use the simulator to filter noise in the input trajectories (trajectory filtering), reconstruct dense trajectories from sparse ones (trajectory reconstruction), and predict future trajectories (trajectory prediction), with all generated trajectories adhering to physical laws. We validate our simulator and algorithm on several datasets including NGSIM and Waymo Open Dataset. The code is publicly available at: https://github.com/SonSang/diffidm.

CVMar 30
SonoWorld: From One Image to a 3D Audio-Visual Scene

Derong Jin, Xiyi Chen, Ming C. Lin et al.

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

CVFeb 22
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

Phuc D. A. Nguyen, Anh N. Nhu, Ming C. Lin

We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.

CVDec 19, 2025
Text-Conditioned Background Generation for Editable Multi-Layer Documents

Taewon Kang, Joseph K J, Chris Tensmeyer et al.

We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.

CVMar 8, 2025
Text2Story: Advancing Video Storytelling with Text Guidance

Taewon Kang, Divya Kothandaraman, Ming C. Lin

Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to difficulties in temporal coherency, preserving semantic meaning, and maintaining both scene context and action continuity across the video. We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing. Specifically, we first present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. We then propose a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep by jointly considering CLIP-based alignment, narrative continuity, and temporal smoothness. To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity and ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. The resulting integrative system prevents abrupt transitions while ensuring fluid storytelling that faithfully reflects both scene and action cues. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. This approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.

CVApr 20, 2024
DMesh: A Differentiable Mesh Representation

Sanghyun Son, Matheus Gadelha, Yang Zhou et al.

We present a differentiable representation, DMesh, for general 3D triangular meshes. DMesh considers both the geometry and connectivity information of a mesh. In our design, we first get a set of convex tetrahedra that compactly tessellates the domain based on Weighted Delaunay Triangulation (WDT), and select triangular faces on the tetrahedra to define the final mesh. We formulate probability of faces to exist on the actual surface in a differentiable manner based on the WDT. This enables DMesh to represent meshes of various topology in a differentiable way, and allows us to reconstruct the mesh under various observations, such as point cloud and multi-view images using gradient-based optimization. The source code and full paper is available at: https://sonsang.github.io/dmesh-project.

CVDec 21, 2024
DMesh++: An Efficient Differentiable Mesh for Complex Shapes

Sanghyun Son, Matheus Gadelha, Yang Zhou et al.

Recent probabilistic methods for 3D triangular meshes capture diverse shapes by differentiable mesh connectivity, but face high computational costs with increased shape details. We introduce a new differentiable mesh processing method that addresses this challenge and efficiently handles meshes with intricate structures. Our method reduces time complexity from O(N) to O(log N) and requires significantly less memory than previous approaches. Building on this innovation, we present a reconstruction algorithm capable of generating complex 2D and 3D shapes from point clouds or multi-view images. Visit our project page (https://sonsang.github.io/dmesh2-project) for source code and supplementary material.

QUANT-PHNov 3, 2024
Differentiable Quantum Computing for Large-scale Linear Control

Connor Clayton, Jiaqi Leng, Gengzhi Yang et al.

As industrial models and designs grow increasingly complex, the demand for optimal control of large-scale dynamical systems has significantly increased. However, traditional methods for optimal control incur significant overhead as problem dimensions grow. In this paper, we introduce an end-to-end quantum algorithm for linear-quadratic control with provable speedups. Our algorithm, based on a policy gradient method, incorporates a novel quantum subroutine for solving the matrix Lyapunov equation. Specifically, we build a quantum-assisted differentiable simulator for efficient gradient estimation that is more accurate and robust than classical methods relying on stochastic approximation. Compared to the classical approaches, our method achieves a super-quadratic speedup. To the best of our knowledge, this is the first end-to-end quantum application to linear control problems with provable quantum advantage.

CVMar 6
NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

Taewon Kang, Ming C. Lin

Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.

ROJan 28, 2025
DISC: Dataset for Analyzing Driving Styles In Simulated Crashes for Mixed Autonomy

Sandip Sharan Senthil Kumar, Sandeep Thalapanane, Guru Nandhan Appiya Dilipkumar Peethambari et al.

Handling pre-crash scenarios is still a major challenge for self-driving cars due to limited practical data and human-driving behavior datasets. We introduce DISC (Driving Styles In Simulated Crashes), one of the first datasets designed to capture various driving styles and behaviors in pre-crash scenarios for mixed autonomy analysis. DISC includes over 8 classes of driving styles/behaviors from hundreds of drivers navigating a simulated vehicle through a virtual city, encountering rare-event traffic scenarios. This dataset enables the classification of pre-crash human driving behaviors in unsafe conditions, supporting individualized trajectory prediction based on observed driving patterns. By utilizing a custom-designed VR-based in-house driving simulator, TRAVERSE, data was collected through a driver-centric study involving human drivers encountering twelve simulated accident scenarios. This dataset fills a critical gap in human-centric driving data for rare events involving interactions with autonomous vehicles. It enables autonomous systems to better react to human drivers and optimize trajectory prediction in mixed autonomy environments involving both human-driven and self-driving cars. In addition, individual driving behaviors are classified through a set of standardized questionnaires, carefully designed to identify and categorize driving behavior traits. We correlate data features with driving behaviors, showing that the simulated environment reflects real-world driving styles. DISC is the first dataset to capture how various driving styles respond to accident scenarios, offering significant potential to enhance autonomous vehicle safety and driving behavior analysis in mixed autonomy environments.

CVSep 29, 2025
Triangle Splatting+: Differentiable Rendering with Opaque Triangles

Jan Held, Renaud Vandeghen, Sanghyun Son et al.

Reconstructing 3D scenes and synthesizing novel views has seen rapid progress in recent years. Neural Radiance Fields demonstrated that continuous volumetric radiance fields can achieve high-quality image synthesis, but their long training and rendering times limit practicality. 3D Gaussian Splatting (3DGS) addressed these issues by representing scenes with millions of Gaussians, enabling real-time rendering and fast optimization. However, Gaussian primitives are not natively compatible with the mesh-based pipelines used in VR headsets, and real-time graphics applications. Existing solutions attempt to convert Gaussians into meshes through post-processing or two-stage pipelines, which increases complexity and degrades visual quality. In this work, we introduce Triangle Splatting+, which directly optimizes triangles, the fundamental primitive of computer graphics, within a differentiable splatting framework. We formulate triangle parametrization to enable connectivity through shared vertices, and we design a training strategy that enforces opaque triangles. The final output is immediately usable in standard graphics engines without post-processing. Experiments on the Mip-NeRF360 and Tanks & Temples datasets show that Triangle Splatting+achieves state-of-the-art performance in mesh-based novel view synthesis. Our method surpasses prior splatting approaches in visual fidelity while remaining efficient and fast to training. Moreover, the resulting semi-connected meshes support downstream applications such as physics-based simulation or interactive walkthroughs. The project page is https://trianglesplatting2.github.io/trianglesplatting2/.

LGJun 1, 2025
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation

Alper Kamil Bozkurt, Calin Belta, Ming C. Lin

To ensure learned controllers comply with safety and reliability requirements for reinforcement learning in real-world settings remains challenging. Traditional safety assurance approaches, such as state avoidance and constrained Markov decision processes, often inadequately capture trajectory requirements or may result in overly conservative behaviors. To address these limitations, recent studies advocate the use of formal specification languages such as linear temporal logic (LTL), enabling the derivation of correct-by-construction learning objectives from the specified requirements. However, the sparse rewards associated with LTL specifications make learning extremely difficult, whereas dense heuristic-based rewards risk compromising correctness. In this work, we propose the first method, to our knowledge, that integrates LTL with differentiable simulators, facilitating efficient gradient-based learning directly from LTL specifications by coupling with differentiable paradigms. Our approach introduces soft labeling to achieve differentiable rewards and states, effectively mitigating the sparse-reward issue intrinsic to LTL without compromising objective correctness. We validate the efficacy of our method through experiments, demonstrating significant improvements in both reward attainment and training time compared to the discrete methods.

CVMay 22, 2025
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

Taewon Kang, Ming C. Lin

Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story produces the corresponding visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and the scene image. A pretrained vision-language encoder extracts high-level semantic features from a representative frame, capturing salient visual context. These features are then integrated with structured prompts to guide a large language model in synthesizing natural dialogue. To ensure contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank -- a speaker-aware, temporally structured memory that recursively accumulates each character's dialogue history. Inspired by Script Theory in cognitive psychology, this design enables characters to speak in ways that reflect their evolving goals, social context, and narrative roles throughout the story. Finally, we render each utterance as expressive, character-conditioned speech, resulting in fully-voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings -- from fantasy adventures to slice-of-life episodes -- offering a scalable solution for coherent, character-grounded audiovisual storytelling.

LGJun 14, 2024
An Intrinsic Vector Heat Network

Alexander Gao, Maurice Chu, Mubbasir Kapadia et al.

Vector fields are widely used to represent and model flows for many science and engineering applications. This paper introduces a novel neural network architecture for learning tangent vector fields that are intrinsically defined on manifold surfaces embedded in 3D. Previous approaches to learning vector fields on surfaces treat vectors as multi-dimensional scalar fields, using traditional scalar-valued architectures to process channels individually, thus fail to preserve fundamental intrinsic properties of the vector field. The core idea of this work is to introduce a trainable vector heat diffusion module to spatially propagate vector-valued feature data across the surface, which we incorporate into our proposed architecture that consists of vector-valued neurons. Our architecture is invariant to rigid motion of the input, isometric deformation, and choice of local tangent bases, and is robust to discretizations of the surface. We evaluate our Vector Heat Network on triangle meshes, and empirically validate its invariant properties. We also demonstrate the effectiveness of our method on the useful industrial application of quadrilateral mesh generation.

SDOct 5, 2021
Voice Aging with Audio-Visual Style Transfer

Justin Wilson, Sunyeong Park, Seunghye J. Wilson et al.

Face aging techniques have used generative adversarial networks (GANs) and style transfer learning to transform one's appearance to look younger/older. Identity is maintained by conditioning these generative networks on a learned vector representation of the source content. In this work, we apply a similar approach to age a speaker's voice, referred to as voice aging. We first analyze the classification of a speaker's age by training a convolutional neural network (CNN) on the speaker's voice and face data from Common Voice and VoxCeleb datasets. We generate aged voices from style transfer to transform an input spectrogram to various ages and demonstrate our method on a mobile app.

CVOct 5, 2021
Echo-Reconstruction: Audio-Augmented 3D Scene Reconstruction

Justin Wilson, Nicholas Rewkowski, Ming C. Lin et al.

Reflective and textureless surfaces such as windows, mirrors, and walls can be a challenge for object and scene reconstruction. These surfaces are often poorly reconstructed and filled with depth discontinuities and holes, making it difficult to cohesively reconstruct scenes that contain these planar discontinuities. We propose Echoreconstruction, an audio-visual method that uses the reflections of sound to aid in geometry and audio reconstruction for virtual conferencing, teleimmersion, and other AR/VR experience. The mobile phone prototype emits pulsed audio, while recording video for RGB-based 3D reconstruction and audio-visual classification. Reflected sound and images from the video are input into our audio (EchoCNN-A) and audio-visual (EchoCNN-AV) convolutional neural networks for surface and sound source detection, depth estimation, and material classification. The inferences from these classifications enhance scene 3D reconstructions containing open spaces and reflective surfaces by depth filtering, inpainting, and placement of unmixed sound sources in the scene. Our prototype, VR demo, and experimental results from real-world and virtual scenes with challenging surfaces and sound indicate high success rates on classification of material, depth estimation, and closed/open surfaces, leading to considerable visual and audio improvement in 3D scenes (see Figure 1).

CVOct 5, 2021
3D-MOV: Audio-Visual LSTM Autoencoder for 3D Reconstruction of Multiple Objects from Video

Justin Wilson, Ming C. Lin

3D object reconstructions of transparent and concave structured objects, with inferred material properties, remains an open research problem for robot navigation in unstructured environments. In this paper, we propose a multimodal single- and multi-frame neural network for 3D reconstructions using audio-visual inputs. Our trained reconstruction LSTM autoencoder 3D-MOV accepts multiple inputs to account for a variety of surface types and views. Our neural network produces high-quality 3D reconstructions using voxel representation. Based on Intersection-over-Union (IoU), we evaluate against other baseline methods using synthetic audio-visual datasets ShapeNet and Sound20K with impact sounds and bounding box annotations. To the best of our knowledge, our single- and multi-frame model is the first audio-visual reconstruction neural network for 3D geometry and material representation.

LGSep 16, 2021
Efficient Differentiable Simulation of Articulated Bodies

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun et al.

We present a method for efficient differentiable simulation of articulated bodies. This enables integration of articulated body dynamics into deep learning frameworks, and gradient-based optimization of neural networks that operate on articulated bodies. We derive the gradients of the forward dynamics using spatial algebra and the adjoint method. Our approach is an order of magnitude faster than autodiff tools. By only saving the initial states throughout the simulation process, our method reduces memory requirements by two orders of magnitude. We demonstrate the utility of efficient differentiable dynamics for articulated bodies in a variety of applications. We show that reinforcement learning with articulated systems can be accelerated using gradients provided by our method. In applications to control and inverse problems, gradient-based optimization enabled by our work accelerates convergence by more than an order of magnitude.

CVFeb 26, 2021
Improving Robustness of Learning-based Autonomous Steering Using Adversarial Images

Yu Shen, Laura Zheng, Manli Shu et al.

For safety of autonomous driving, vehicles need to be able to drive under various lighting, weather, and visibility conditions in different environments. These external and environmental factors, along with internal factors associated with sensors, can pose significant challenges to perceptual data processing, hence affecting the decision-making and control of the vehicle. In this work, we address this critical issue by introducing a framework for analyzing robustness of the learning algorithm w.r.t varying quality in the image input for autonomous driving. Using the results of sensitivity analysis, we further propose an algorithm to improve the overall performance of the task of "learning to steer". The results show that our approach is able to enhance the learning outcomes up to 48%. A comparative study drawn between our approach and other related techniques, such as data augmentation and adversarial training, confirms the effectiveness of our algorithm as a way to improve the robustness and generalization of neural network training for autonomous driving.

LGJul 4, 2020
Scalable Differentiable Physics for Learning and Control

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun et al.

Differentiable physics is a powerful approach to learning and control problems that involve physical objects and environments. While notable progress has been made, the capabilities of differentiable physics solvers remain limited. We develop a scalable framework for differentiable physics that can support a large number of objects and their interactions. To accommodate objects with arbitrary geometry and topology, we adopt meshes as our representation and leverage the sparsity of contacts for scalable differentiable collision handling. Collisions are resolved in localized regions to minimize the number of optimization variables even when the number of simulated objects is high. We further accelerate implicit differentiation of optimization with nonlinear constraints. Experiments demonstrate that the presented framework requires up to two orders of magnitude less memory and computation in comparison to recent particle-based methods. We further validate the approach on inverse problems and control scenarios, where it outperforms derivative-free and model-free baselines by at least an order of magnitude.

CVAug 26, 2019
Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images

Junbang Liang, Ming C. Lin

We propose a scalable neural network framework to reconstruct the 3D mesh of a human body from multi-view images, in the subspace of the SMPL model. Use of multi-view images can significantly reduce the projection ambiguity of the problem, increasing the reconstruction accuracy of the 3D human body under clothing. Our experiments show that this method benefits from the synthetic dataset generated from our pipeline since it has good flexibility of variable control and can provide ground-truth for validation. Our method outperforms existing methods on real-world images, especially on shape estimations.

ROJul 20, 2019
ADAPS: Autonomous Driving Via Principled Simulations

Weizi Li, David Wolinski, Ming C. Lin

Autonomous driving has gained significant advancements in recent years. However, obtaining a robust control policy for driving remains challenging as it requires training data from a variety of scenarios, including rare situations (e.g., accidents), an effective policy architecture, and an efficient learning mechanism. We propose ADAPS for producing robust control policies for autonomous vehicles. ADAPS consists of two simulation platforms in generating and analyzing accidents to automatically produce labeled training data, and a memory-enabled hierarchical control policy. Additionally, ADAPS offers a more efficient online learning mechanism that reduces the number of iterations required in learning compared to existing methods such as DAGGER. We present both theoretical and experimental results. The latter are produced in simulated environments, where qualitative and quantitative results are generated to demonstrate the benefits of ADAPS.

CVAug 3, 2016
Detailed Garment Recovery from a Single-View Image

Shan Yang, Tanya Ambert, Zherong Pan et al.

Most recent garment capturing techniques rely on acquiring multiple views of clothing, which may not always be readily available, especially in the case of pre-existing photographs from the web. As an alternative, we pro- pose a method that is able to compute a rich and realistic 3D model of a human body and its outfits from a single photograph with little human in- teraction. Our algorithm is not only able to capture the global shape and geometry of the clothing, it can also extract small but important details of cloth, such as occluded wrinkles and folds. Unlike previous methods using full 3D information (i.e. depth, multi-view images, or sampled 3D geom- etry), our approach achieves detailed garment recovery from a single-view image by using statistical, geometric, and physical priors and a combina- tion of parameter estimation, semantic parsing, shape recovery, and physics- based cloth simulation. We demonstrate the effectiveness of our algorithm by re-purposing the reconstructed garments for virtual try-on and garment transfer applications, as well as cloth animation for digital characters.