Darren Cosker

h-index24

14papers

687citations

Novelty50%

AI Score34

Ranked #111,320 of 194,257 authors (top 57%)#37,183 in CV (top 63%)

14 Papers

19.5CVApr 12, 2023Code

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

Siwei Zhang, Qianli Ma, Yan Zhang et al.

Automatic perception of human behaviors during social interactions is crucial for AR/VR applications, and an essential component is estimation of plausible 3D human pose and shape of our social partners from the egocentric view. One of the biggest challenges of this task is severe body truncation due to close social distances in egocentric scenarios, which brings large pose ambiguities for unseen body parts. To tackle this challenge, we propose a novel scene-conditioned diffusion method to model the body pose distribution. Conditioned on the 3D scene geometry, the diffusion model generates bodies in plausible human-scene interactions, with the sampling guided by a physics-based collision score to further resolve human-scene inter-penetrations. The classifier-free training enables flexible sampling with different conditions and enhanced diversity. A visibility-aware graph convolution model guided by per-joint visibility serves as the diffusion denoiser to incorporate inter-joint dependencies and per-body-part control. Extensive evaluations show that our method generates bodies in plausible interactions with 3D scenes, achieving both superior accuracy for visible joints and diversity for invisible body parts. The code is available at https://sanweiliti.github.io/egohmr/egohmr.html.

26.3CVDec 30, 2022

Imitator: Personalized Speech-driven 3D Facial Animation

Balamurugan Thambiraja, Ikhsanul Habibie, Sadegh Aliakbarian et al.

Speech-driven 3D facial animation has been widely explored, with applications in gaming, character animation, virtual reality, and telepresence systems. State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor, thus, resulting in unrealistic and inaccurate lip movements. To address this, we present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. Based on this prior, we optimize for identity-specific speaking style based on a short reference video. To train the prior, we introduce a novel loss function based on detected bilabial consonants to ensure plausible lip closures and consequently improve the realism of the generated expressions. Through detailed experiments and a user study, we show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.

14.1CVAug 22, 2023

HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations

Sadegh Aliakbarian, Fatemeh Saleh, David Collier et al.

Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our knowledge, all existing approaches rely on full hand visibility. While this is the case when, e.g., using motion controllers, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility owing to the restricted field of view of the HMD. In this paper, we propose the first unified approach, HMD-NeMo, that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts the full body motion in an online and real-time fashion. At the heart of HMD-NeMo is the spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. We perform extensive analysis of the impact of different components in HMD-NeMo and introduce a new state-of-the-art on AMASS dataset through our evaluation.

18.6CVOct 15, 2024

Look Ma, no markers: holistic performance capture without the hassle

Charlie Hewitt, Fatemeh Saleh, Sadegh Aliakbarian et al.

We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.

10.5CVDec 10, 2024

GASP: Gaussian Avatars with Synthetic Priors

Jack Saunders, Charlie Hewitt, Yanan Jian et al.

Gaussian Splatting has changed the game for real-time photo-realistic rendering. One of the most popular applications of Gaussian Splatting is to create animatable avatars, known as Gaussian Avatars. Recent works have pushed the boundaries of quality and rendering efficiency but suffer from two main limitations. Either they require expensive multi-camera rigs to produce avatars with free-view rendering, or they can be trained with a single camera but only rendered at high quality from this fixed viewpoint. An ideal model would be trained using a short monocular video or image from available hardware, such as a webcam, and rendered from any view. To this end, we propose GASP: Gaussian Avatars with Synthetic Priors. To overcome the limitations of existing datasets, we exploit the pixel-perfect nature of synthetic data to train a Gaussian Avatar prior. By fitting this prior model to a single photo or video and fine-tuning it, we get a high-quality Gaussian Avatar, which supports 360$^\circ$ rendering. Our prior is only required for fitting, not inference, enabling real-time application. Through our method, we obtain high-quality, animatable Avatars from limited data which can be animated and rendered at 70fps on commercial hardware. See our project page (https://microsoft.github.io/GASP/) for results.

1.4CVJul 15, 2021

DynaDog+T: A Parametric Animal Model for Synthetic Canine Image Generation

Jake Deane, Sinead Kearney, Kwang In Kim et al.

Synthetic data is becoming increasingly common for training computer vision models for a variety of tasks. Notably, such data has been applied in tasks related to humans such as 3D pose estimation where data is either difficult to create or obtain in realistic settings. Comparatively, there has been less work into synthetic animal data and it's uses for training models. Consequently, we introduce a parametric canine model, DynaDog+T, for generating synthetic canine images and data which we use for a common computer vision task, binary segmentation, which would otherwise be difficult due to the lack of available data.

1.2GRJul 1, 2021

EmoGen: Quantifiable Emotion Generation and Analysis for Experimental Psychology

Nadejda Roubtsova, Martin Parsons, Nicola Binetti et al.

3D facial modelling and animation in computer vision and graphics traditionally require either digital artist's skill or complex pipelines with objective-function-based solvers to fit models to motion capture. This inaccessibility of quality modelling to a non-expert is an impediment to effective quantitative study of facial stimuli in experimental psychology. The EmoGen methodology we present in this paper solves the issue democratising facial modelling technology. EmoGen is a robust and configurable framework letting anyone author arbitrary quantifiable facial expressions in 3D through a user-guided genetic algorithm search. Beyond sample generation, our methodology is made complete with techniques to analyse distributions of these expressions in a principled way. This paper covers the technical aspects of expression generation, specifically our production-quality facial blendshape model, automatic corrective mechanisms of implausible facial configurations in the absence of artist's supervision and the genetic algorithm implementation employed in the model space search. Further, we provide a comparative evaluation of ways to quantify generated facial expressions in the blendshape and geometric domains and compare them theoretically and empirically. The purpose of this analysis is 1. to define a similarity cost function to simulate model space search for convergence and parameter dependence assessment of the genetic algorithm and 2. to inform the best practices in the data distribution analysis for experimental psychology.

14.7CVApr 16, 2020Code

RGBD-Dog: Predicting Canine Pose from RGBD Sensors

Sinead Kearney, Wenbin Li, Martin Parsons et al.

The automatic extraction of animal \reb{3D} pose from images without markers is of interest in a range of scientific fields. Most work to date predicts animal pose from RGB images, based on 2D labelling of joint positions. However, due to the difficult nature of obtaining training data, no ground truth dataset of 3D animal motion is available to quantitatively evaluate these approaches. In addition, a lack of 3D animal pose data also makes it difficult to train 3D pose-prediction methods in a similar manner to the popular field of body-pose prediction. In our work, we focus on the problem of 3D canine pose estimation from RGBD images, recording a diverse range of dog breeds with several Microsoft Kinect v2s, simultaneously obtaining the 3D ground truth skeleton via a motion capture system. We generate a dataset of synthetic RGBD images from this data. A stacked hourglass network is trained to predict 3D joint locations, which is then constrained using prior models of shape and pose. We evaluate our model on both synthetic and real RGBD images and compare our results to previously published work fitting canine models to images. Finally, despite our training set consisting only of dog data, visual inspection implies that our network can produce good predictions for images of other quadrupeds -- e.g. horses or cats -- when their pose is similar to that contained in our training set.

18.1CVJun 6, 2018Code

Unsupervised Attention-guided Image to Image Translation

Youssef A. Mejjati, Christian Richardt, James Tompkin et al.

Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in human perception, we tackle this limitation by introducing unsupervised attention mechanisms that are jointly adversarialy trained with the generators and discriminators. We demonstrate qualitatively and quantitatively that our approach is able to attend to relevant regions in the image without requiring supervision, and that by doing so it achieves more realistic mappings compared to recent approaches.

0.9CVApr 19, 2017

Learn to Model Motion from Blurry Footages

Wenbin Li, Da Chen, Zhihan Lv et al.

It is difficult to recover the motion field from a real-world footage given a mixture of camera shake and other photometric effects. In this paper we propose a hybrid framework by interleaving a Convolutional Neural Network (CNN) and a traditional optical flow energy. We first conduct a CNN architecture using a novel learnable directional filtering layer. Such layer encodes the angle and distance similarity matrix between blur and camera motion, which is able to enhance the blur features of the camera-shake footages. The proposed CNNs are then integrated into an iterative optical flow framework, which enable the capability of modelling and solving both the blind deconvolution and the optical flow estimation problems simultaneously. Our framework is trained end-to-end on a synthetic dataset and yields competitive precision and performance against the state-of-the-art approaches.

9.4CVAug 2, 2016

Interactive Removal and Ground Truth for Difficult Shadow Scenes

Han Gong, Darren P. Cosker

A user-centric method for fast, interactive, robust and high-quality shadow removal is presented. Our algorithm can perform detection and removal in a range of difficult cases: such as highly textured and colored shadows. To perform detection an on-the-fly learning approach is adopted guided by two rough user inputs for the pixels of the shadow and the lit area. After detection, shadow removal is performed by registering the penumbra to a normalized frame which allows us efficient estimation of non-uniform shadow illumination changes, resulting in accurate and robust removal. Another major contribution of this work is the first validated and multi-scene category ground truth for shadow removal algorithms. This data set containing 186 images eliminates inconsistencies between shadow and shadow-free images and provides a range of different shadow types such as soft, textured, colored and broken shadow. Using this data, the most thorough comparison of state-of-the-art shadow removal methods to date is performed, showing our proposed new algorithm to outperform the state-of-the-art across several measures and shadow category. To complement our dataset, an online shadow removal benchmark website is also presented to encourage future open comparisons in this challenging field of research.

3.8CVMar 26, 2016

Video Interpolation using Optical Flow and Laplacian Smoothness

Wenbin Li, Darren Cosker

Non-rigid video interpolation is a common computer vision task. In this paper we present an optical flow approach which adopts a Laplacian Cotangent Mesh constraint to enhance the local smoothness. Similar to Li et al., our approach adopts a mesh to the image with a resolution up to one vertex per pixel and uses angle constraints to ensure sensible local deformations between image pairs. The Laplacian Mesh constraints are expressed wholly inside the optical flow optimization, and can be applied in a straightforward manner to a wide range of image tracking and registration problems. We evaluate our approach by testing on several benchmark datasets, including the Middlebury and Garg et al. datasets. In addition, we show application of our method for constructing 3D Morphable Facial Models from dynamic 3D data.

6.7CVMar 7, 2016

Blur Robust Optical Flow using Motion Channel

Wenbin Li, Yang Chen, JeeHang Lee et al.

It is hard to estimate optical flow given a realworld video sequence with camera shake and other motion blur. In this paper, we first investigate the blur parameterization for video footage using near linear motion elements. we then combine a commercial 3D pose sensor with an RGB camera, in order to film video footage of interest together with the camera motion. We illustrates that this additional camera motion/trajectory channel can be embedded into a hybrid framework by interleaving an iterative blind deconvolution and warping based optical flow scheme. Our method yields improved accuracy within three other state-of-the-art baselines given our proposed ground truth blurry sequences; and several other realworld sequences filmed by our imaging system.

6.7CVMar 7, 2016

Drift Robust Non-rigid Optical Flow Enhancement for Long Sequences

Wenbin Li, Darren Cosker, Matthew Brown

It is hard to densely track a nonrigid object in long term, which is a fundamental research issue in the computer vision community. This task often relies on estimating pairwise correspondences between images over time where the error is accumulated and leads to a drift issue. In this paper, we introduce a novel optimization framework with an Anchor Patch constraint. It is supposed to significantly reduce overall errors given long sequences containing non-rigidly deformable objects. Our framework can be applied to any dense tracking algorithm, e.g. optical flow. We demonstrate the success of our approach by showing significant error reduction on 6 popular optical flow algorithms applied to a range of real-world nonrigid benchmarks. We also provide quantitative analysis of our approach given synthetic occlusions and image noise.