Youjia Wang

CV
h-index11
6papers
20citations
Novelty64%
AI Score44

6 Papers

CVJul 3, 2022
NARRATE: A Normal Assisted Free-View Portrait Stylizer

Youjia Wang, Teng Xu, Yiwen Wu et al.

In this work, we propose NARRATE, a novel pipeline that enables simultaneously editing portrait lighting and perspective in a photorealistic manner. As a hybrid neural-physical face model, NARRATE leverages complementary benefits of geometry-aware generative approaches and normal-assisted physical face models. In a nutshell, NARRATE first inverts the input portrait to a coarse geometry and employs neural rendering to generate images resembling the input, as well as producing convincing pose changes. However, inversion step introduces mismatch, bringing low-quality images with less facial details. As such, we further estimate portrait normal to enhance the coarse geometry, creating a high-fidelity physical face model. In particular, we fuse the neural and physical renderings to compensate for the imperfect inversion, resulting in both realistic and view-consistent novel perspective images. In relighting stage, previous works focus on single view portrait relighting but ignoring consistency between different perspectives as well, leading unstable and inconsistent lighting effects for view changes. We extend Total Relighting to fix this problem by unifying its multi-view input normal maps with the physical face model. NARRATE conducts relighting with consistent normal maps, imposing cross-view constraints and exhibiting stable and coherent illumination effects. We experimentally demonstrate that NARRATE achieves more photorealistic, reliable results over prior works. We further bridge NARRATE with animation and style transfer tools, supporting pose change, light change, facial animation, and style transfer, either separately or in combination, all at a photographic quality. We showcase vivid free-view facial animations as well as 3D-aware relightable stylization, which help facilitate various AR/VR applications like virtual cinematography, 3D video conferencing, and post-production.

PRFeb 8, 2023
Improved Langevin Monte Carlo for stochastic optimization via landscape modification

Michael C. H. Choi, Youjia Wang

Given a target function $H$ to minimize or a target Gibbs distribution $π_β^0 \propto e^{-βH}$ to sample from in the low temperature, in this paper we propose and analyze Langevin Monte Carlo (LMC) algorithms that run on an alternative landscape as specified by $H^f_{β,c,1}$ and target a modified Gibbs distribution $π^f_{β,c,1} \propto e^{-βH^f_{β,c,1}}$, where the landscape of $H^f_{β,c,1}$ is a transformed version of that of $H$ which depends on the parameters $f,β$ and $c$. While the original Log-Sobolev constant affiliated with $π^0_β$ exhibits exponential dependence on both $β$ and the energy barrier $M$ in the low temperature regime, with appropriate tuning of these parameters and subject to assumptions on $H$, we prove that the energy barrier of the transformed landscape is reduced which consequently leads to polynomial dependence on both $β$ and $M$ in the modified Log-Sobolev constant associated with $π^f_{β,c,1}$. This yield improved total variation mixing time bounds and improved convergence toward a global minimum of $H$. We stress that the technique developed in this paper is not only limited to LMC and is broadly applicable to other gradient-based optimization or sampling algorithms.

7.2PRMar 18
Geometry and factorization of multivariate Markov chains with applications to MCMC acceleration and approximate inference

Michael C. H. Choi, Youjia Wang, Geoffrey Wolfer

This paper analyzes the factorizability and geometry of transition matrices of multivariate Markov chains. Specifically, we demonstrate that the induced chains on factors of a product space can be regarded as information projections with respect to the Kullback-Leibler divergence. This perspective yields Han-Shearer type inequalities and submodularity of the entropy rate of Markov chains, as well as applications in the context of large deviations and mixing time comparison. As concrete algorithmic applications in Markov chain Monte Carlo (MCMC) and approximate inference, we provide three illustrations based on lifted MCMC, swapping algorithm and factored filtering to demonstrate projection samplers improve mixing over the original samplers. The projection sampler based on the swapping algorithm resamples the highest-temperature coordinate at stationarity at each step, and we prove that such practice accelerates the mixing time by multiplicative factors related to the number of temperatures and the dimension of the underlying state space when compared with the original swapping algorithm. Through simple numerical experiments on a bimodal target distribution, we show that the projection samplers mix effectively, in contrast to lifted MCMC and the swapping algorithm, which mix less well. In filtering, our proposed factored filtering scheme is able to scale to high dimensions with linear-in-dimension computational cost per step at the price of an approximation error that can be tracked using the distance to independence, compared with the exponential-in-dimension cost per step of the exact filter.

CVMar 13, 2025
MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis

Teng Xu, Taotao Zhou, Youjia Wang et al.

Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we introduce MouseGPT, a Vision-Language Model (VLM) that integrates visual cues with natural language to revolutionize mouse behavior analysis. Built upon our first-of-its-kind dataset - incorporating pose dynamics and open-vocabulary behavioral annotations across over 42 million frames of diverse psychiatric conditions - MouseGPT provides a novel, context-rich method for comprehensive behavior interpretation. Our holistic analysis framework enables detailed behavior profiling, clustering, and novel behavior discovery, offering deep insights without the need for labor - intensive manual annotation. Evaluations reveal that MouseGPT surpasses existing models in precision, adaptability, and descriptive richness, positioning it as a transformative tool for ethology and for unraveling complex behavioral dynamics in animal models.

CVFeb 3, 2024
Capturing the Unseen: Vision-Free Facial Motion Capture Using Inertial Measurement Units

Youjia Wang, Yiwen Wu, Hengan Zhou et al.

We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in full-body MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for vision-free facial MoCap.

CVJul 30, 2021
Neural Relighting and Expression Transfer On Video Portraits

Youjia Wang, Taotao Zhou, Minzhang Li et al.

Photo-realistic video portrait reenactment benefits virtual production and numerous VR/AR experiences. The task remains challenging as the reenacted expression should match the source while the lighting should be adjustable to new environments. We present a neural relighting and expression transfer technique to transfer the facial expressions from a source performer to a portrait video of a target performer while enabling dynamic relighting. Our approach employs 4D reflectance field learning, model-based facial performance capture and target-aware neural rendering. Specifically, given a short sequence of the target performer's OLAT, we apply a rendering-to-video translation network to first synthesize the OLAT result of new sequences with unseen expressions. We then design a semantic-aware facial normalization scheme along with a multi-frame multi-task learning strategy to encode the content, segmentation, and motion flows for reliably inferring the reflectance field. This allows us to simultaneously control facial expression and apply virtual relighting. Extensive experiments demonstrate that our technique can robustly handle challenging expressions and lighting environments and produce results at a cinematographic quality.