CVMar 5, 2023
Text2Face: A Multi-Modal 3D Face ModelWill Rowan, Patrik Huber, Nick Pears et al.
We present the first 3D morphable modelling approach, whereby 3D face shape can be directly and completely defined using a textual prompt. Building on work in multi-modal learning, we extend the FLAME head model to a common image-and-text latent space. This allows for direct 3D Morphable Model (3DMM) parameter generation and therefore shape manipulation from textual descriptions. Our method, Text2Face, has many applications; for example: generating police photofits where the input is already in natural language. It further enables multi-modal 3DMM image fitting to sketches and sculptures, as well as images.
41.8CVApr 20
HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial ImagesPourya Shamsolmoali, Masoumeh Zareapoor, Michael Felsberg et al.
Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.
CVFeb 4, 2023
Laplacian ICP for Progressive Registration of 3D Human Head MeshesNick Pears, Hang Dai, Will Smith et al.
We present a progressive 3D registration framework that is a highly-efficient variant of classical non-rigid Iterative Closest Points (N-ICP). Since it uses the Laplace-Beltrami operator for deformation regularisation, we view the overall process as Laplacian ICP (L-ICP). This exploits a `small deformation per iteration' assumption and is progressively coarse-to-fine, employing an increasingly flexible deformation model, an increasing number of correspondence sets, and increasingly sophisticated correspondence estimation. Correspondence matching is only permitted within predefined vertex subsets derived from domain-specific feature extractors. Additionally, we present a new benchmark and a pair of evaluation metrics for 3D non-rigid registration, based on annotation transfer. We use this to evaluate our framework on a publicly-available dataset of 3D human head scans (Headspace). The method is robust and only requires a small fraction of the computation time compared to the most popular classical approach, yet has comparable registration performance.
CVMay 13, 2022
The Effectiveness of Temporal Dependency in Deepfake Video DetectionWill Rowan, Nick Pears
Deepfakes are a form of synthetic image generation used to generate fake videos of individuals for malicious purposes. The resulting videos may be used to spread misinformation, reduce trust in media, or as a form of blackmail. These threats necessitate automated methods of deepfake video detection. This paper investigates whether temporal information can improve the deepfake detection performance of deep learning models. To investigate this, we propose a framework that classifies new and existing approaches by their defining characteristics. These are the types of feature extraction: automatic or manual, and the temporal relationship between frames: dependent or independent. We apply this framework to investigate the effect of temporal dependency on a model's deepfake detection performance. We find that temporal dependency produces a statistically significant (p < 0.05) increase in performance in classifying real images for the model using automatic feature selection, demonstrating that spatio-temporal information can increase the performance of deepfake video detection models.
CVJul 25, 2023
Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face ReconstructionWill Rowan, Patrik Huber, Nick Pears et al.
Accurate 3D face reconstruction from 2D images is an enabling technology with applications in healthcare, security, and creative industries. However, current state-of-the-art methods either rely on supervised training with very limited 3D data or self-supervised training with 2D image data. To bridge this gap, we present a method to generate a large-scale synthesised dataset of 250K photorealistic images and their corresponding shape parameters and depth maps, which we call SynthFace. Our synthesis method conditions Stable Diffusion on depth maps sampled from the FLAME 3D Morphable Model (3DMM) of the human face, allowing us to generate a diverse set of shape-consistent facial images that is designed to be balanced in race and gender. We further propose ControlFace, a deep neural network, trained on SynthFace, which achieves competitive performance on the NoW benchmark, without requiring 3D supervision or manual 3D asset creation. The complete SynthFace dataset will be made publicly available upon publication.
CVJan 30, 2023
Accurate Gaze Estimation using an Active-gaze Morphable ModelHao Sun, Nick Pears
Rather than regressing gaze direction directly from images, we show that adding a 3D shape model can: i) improve gaze estimation accuracy, ii) perform well with lower resolution inputs and iii) provide a richer understanding of the eye-region and its constituent gaze system. Specifically, we use an `eyes and nose' 3D morphable model (3DMM) to capture the eye-region 3D facial geometry and appearance and we equip this with a geometric vergence model of gaze to give an `active-gaze 3DMM'. We show that our approach achieves state-of-the-art results on the Eyediap dataset and we present an ablation study. Our method can learn with only the ground truth gaze target point and the camera parameters, without access to the ground truth gaze origin points, thus widening the applicability of our approach compared to other methods.
CVMay 18, 2019Code
SAWNet: A Spatially Aware Deep Neural Network for 3D Point Cloud ProcessingChaitanya Kaul, Nick Pears, Suresh Manandhar
Deep neural networks have established themselves as the state-of-the-art methodology in almost all computer vision tasks to date. But their application to processing data lying on non-Euclidean domains is still a very active area of research. One such area is the analysis of point cloud data which poses a challenge due to its lack of order. Many recent techniques have been proposed, spearheaded by the PointNet architecture. These techniques use either global or local information from the point clouds to extract a latent representation for the points, which is then used for the task at hand (classification/segmentation). In our work, we introduce a neural network layer that combines both global and local information to produce better embeddings of these points. We enhance our architecture with residual connections, to pass information between the layers, which also makes the network easier to train. We achieve state-of-the-art results on the ModelNet40 dataset with our architecture, and our results are also highly competitive with the state-of-the-art on the ShapeNet part segmentation dataset and the indoor scene segmentation dataset. We plan to open source our pre-trained models on github to encourage the research community to test our networks on their data, or simply use them for benchmarking purposes.
CVApr 7, 2021
FatNet: A Feature-attentive Network for 3D Point Cloud ProcessingChaitanya Kaul, Nick Pears, Suresh Manandhar
The application of deep learning to 3D point clouds is challenging due to its lack of order. Inspired by the point embeddings of PointNet and the edge embeddings of DGCNNs, we propose three improvements to the task of point cloud analysis. First, we introduce a novel feature-attentive neural network layer, a FAT layer, that combines both global point-based features and local edge-based features in order to generate better embeddings. Second, we find that applying the same attention mechanism across two different forms of feature map aggregation, max pooling and average pooling, gives better performance than either alone. Third, we observe that residual feature reuse in this setting propagates information more effectively between the layers, and makes the network easier to train. Our architecture achieves state-of-the-art results on the task of point cloud classification, as demonstrated on the ModelNet40 dataset, and an extremely competitive performance on the ShapeNet part segmentation challenge.
CVOct 7, 2020
A Human Ear Reconstruction AutoencoderHao Sun, Nick Pears, Hang Dai
The ear, as an important part of the human head, has received much less attention compared to the human face in the area of computer vision. Inspired by previous work on monocular 3D face reconstruction using an autoencoder structure to achieve self-supervised learning, we aim to utilise such a framework to tackle the 3D ear reconstruction task, where more subtle and difficult curves and features are present on the 2D ear input images. Our Human Ear Reconstruction Autoencoder (HERA) system predicts 3D ear poses and shape parameters for 3D ear meshes, without any supervision to these parameters. To make our approach cover the variance for in-the-wild images, even grayscale images, we propose an in-the-wild ear colour model. The constructed end-to-end self-supervised model is then evaluated both with 2D landmark localisation performance and the appearance of the reconstructed 3D ears.
IVDec 4, 2019
FocusNet++: Attentive Aggregated Transformations for Efficient and Accurate Medical Image SegmentationChaitanya Kaul, Nick Pears, Hang Dai et al.
We propose a new residual block for convolutional neural networks and demonstrate its state-of-the-art performance in medical image segmentation. We combine attention mechanisms with group convolutions to create our group attention mechanism, which forms the fundamental building block of our network, FocusNet++. We employ a hybrid loss based on balanced cross entropy, Tversky loss and the adaptive logarithmic loss to enhance the performance along with fast convergence. Our results show that FocusNet++ achieves state-of-the-art results across various benchmark metrics for the ISIC 2018 melanoma segmentation and the cell nuclei segmentation datasets with fewer parameters and FLOPs.
CVNov 18, 2019
Towards a complete 3D morphable model of the human headStylianos Ploumpis, Evangelos Ververas, Eimear O' Sullivan et al.
Three-dimensional Morphable Models (3DMMs) are powerful statistical tools for representing the 3D shapes and textures of an object class. Here we present the most complete 3DMM of the human head to date that includes face, cranium, ears, eyes, teeth and tongue. To achieve this, we propose two methods for combining existing 3DMMs of different overlapping head parts: i. use a regressor to complete missing parts of one model using the other, ii. use the Gaussian Process framework to blend covariance matrices from multiple models. Thus we build a new combined face-and-head shape model that blends the variability and facial detail of an existing face model (the LSFM) with the full head modelling capability of an existing head model (the LYHM). Then we construct and fuse a highly-detailed ear model to extend the variation of the ear shape. Eye and eye region models are incorporated into the head model, along with basic models of the teeth, tongue and inner mouth cavity. The new model achieves state-of-the-art performance. We use our model to reconstruct full head representations from single, unconstrained images allowing us to parameterize craniofacial shape and texture, along with the ear shape, eye gaze and eye color.
IVOct 22, 2019
Penalizing small errors using an Adaptive Logarithmic LossChaitanya Kaul, Nick Pears, Hang Dai et al.
Loss functions are error metrics that quantify the difference between a prediction and its corresponding ground truth. Fundamentally, they define a functional landscape for traversal by gradient descent. Although numerous loss functions have been proposed to date in order to handle various machine learning problems, little attention has been given to enhancing these functions to better traverse the loss landscape. In this paper, we simultaneously and significantly mitigate two prominent problems in medical image segmentation namely: i) class imbalance between foreground and background pixels and ii) poor loss function convergence. To this end, we propose an adaptive logarithmic loss function. We compare this loss function with the existing state-of-the-art on the ISIC 2018 dataset, the nuclei segmentation dataset as well as the DRIVE retinal vessel segmentation dataset. We measure the performance of our methodology on benchmark metrics and demonstrate state-of-the-art performance. More generally, we show that our system can be used as a framework for better training of deep neural networks.
CVMar 9, 2019
Combining 3D Morphable Models: A Large scale Face-and-Head ModelStylianos Ploumpis, Haoyang Wang, Nick Pears et al.
Three-dimensional Morphable Models (3DMMs) are powerful statistical tools for representing the 3D surfaces of an object class. In this context, we identify an interesting question that has previously not received research attention: is it possible to combine two or more 3DMMs that (a) are built using different templates that perhaps only partly overlap, (b) have different representation capabilities and (c) are built from different datasets that may not be publicly-available? In answering this question, we make two contributions. First, we propose two methods for solving this problem: i. use a regressor to complete missing parts of one model using the other, ii. use the Gaussian Process framework to blend covariance matrices from multiple models. Second, as an example application of our approach, we build a new face-and-head shape model that combines the variability and facial detail of the LSFM with the full head modelling of the LYHM. The resulting combined shape model achieves state-of-the-art performance and outperforms existing head models by a large margin. Finally, as an application experiment, we reconstruct full head representations from single, unconstrained images by utilizing our proposed large-scale model in conjunction with the FaceWarehouse blendshapes for handling expressions.
CVFeb 8, 2019
FocusNet: An attention-based Fully Convolutional Network for Medical Image SegmentationChaitanya Kaul, Suresh Manandhar, Nick Pears
We propose a novel technique to incorporate attention within convolutional neural networks using feature maps generated by a separate convolutional autoencoder. Our attention architecture is well suited for incorporation with deep convolutional networks. We evaluate our model on benchmark segmentation datasets in skin cancer segmentation and lung lesion segmentation. Results show highly competitive performance when compared with U-Net and it's residual variant.
CVOct 15, 2018
Vehicle classification using ResNets, localisation and spatially-weighted poolingRohan Watkins, Nick Pears, Suresh Manandhar
We investigate whether ResNet architectures can outperform more traditional Convolutional Neural Networks on the task of fine-grained vehicle classification. We train and test ResNet-18, ResNet-34 and ResNet-50 on the Comprehensive Cars dataset without pre-training on other datasets. We then modify the networks to use Spatially Weighted Pooling. Finally, we add a localisation step before the classification process, using a network based on ResNet-50. We find that using Spatially Weighted Pooling and localisation both improve classification accuracy of ResNet50. Spatially Weighted Pooling increases accuracy by 1.5 percent points and localisation increases accuracy by 3.4 percent points. Using both increases accuracy by 3.7 percent points giving a top-1 accuracy of 96.351\% on the Comprehensive Cars dataset. Our method achieves higher accuracy than a range of methods including those that use traditional CNNs. However, our method does not perform quite as well as pre-trained networks that use Spatially Weighted Pooling.
CVMar 21, 2018
Non-rigid 3D Shape Registration using an Adaptive TemplateHang Dai, Nick Pears, William Smith
We present a new fully-automatic non-rigid 3D shape registration (morphing) framework comprising (1) a new 3D landmarking and pose normalisation method; (2) an adaptive shape template method to accelerate the convergence of registration algorithms and achieve a better final shape correspondence and (3) a new iterative registration method that combines Iterative Closest Points with Coherent Point Drift (CPD) to achieve a more stable and accurate correspondence establishment than standard CPD. We call this new morphing approach Iterative Coherent Point Drift (ICPD). Our proposed framework is evaluated qualitatively and quantitatively on three datasets and compared with several other methods. The proposed framework is shown to give state-of-the-art performance.
CVJan 21, 2016
Automatic 3D modelling of craniofacial formNick Pears, Christian Duncan
Three-dimensional models of craniofacial variation over the general population are useful for assessing pre- and post-operative head shape when treating various craniofacial conditions, such as craniosynostosis. We present a new method of automatically building both sagittal profile models and full 3D surface models of the human head using a range of techniques in 3D surface image analysis; in particular, automatic facial landmarking using supervised machine learning, global and local symmetry plane detection using a variant of trimmed iterative closest points, locally-affine template warping (for full 3D models) and a novel pose normalisation using robust iterative ellipse fitting. The PCA-based models built using the new pose normalisation are more compact than those using Generalised Procrustes Analysis and we demonstrate their utility in a clinical case study.