Adrian Hilton

CV
h-index21
40papers
575citations
Novelty51%
AI Score48

40 Papers

CVAug 23, 2022
Super-resolution 3D Human Shape from a Single Low-Resolution Image

Marco Pesavento, Marco Volino, Adrian Hilton

We propose a novel framework to reconstruct super-resolution human shape from a single low-resolution input image. The approach overcomes limitations of existing approaches that reconstruct 3D human shape from a single image, which require high-resolution images together with auxiliary data such as surface normal or a parametric model to reconstruct high-detail shape. The proposed framework represents the reconstructed shape with a high-detail implicit function. Analogous to the objective of 2D image super-resolution, the approach learns the mapping from a low-resolution shape to its high-resolution counterpart and it is applied to reconstruct 3D shape detail from low-resolution images. The approach is trained end-to-end employing a novel loss function which estimates the information lost between a low and high-resolution representation of the same 3D surface shape. Evaluation for single image reconstruction of clothed people demonstrates that our method achieves high-detail surface reconstruction from low-resolution images without auxiliary data. Extensive experiments show that the proposed approach can estimate super-resolution human geometries with a significantly higher level of detail than that obtained with previous approaches when applied to low-resolution images.

CVAug 9, 2023
PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson et al.

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

CVMar 26, 2023
SEM-POS: Grammatically and Semantically Correct Video Captioning

Asmar Nadeem, Adrian Hilton, Robert Dawes et al.

Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.

CVOct 25, 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA

Asmar Nadeem, Adrian Hilton, Robert Dawes et al.

In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

CVJul 15, 2024
COSMU: Complete 3D human shape from monocular unconstrained images

Marco Pesavento, Marco Volino, Adrian Hilton

We present a novel framework to reconstruct complete 3D human shapes from a given target image by leveraging monocular unconstrained images. The objective of this work is to reproduce high-quality details in regions of the reconstructed human body that are not visible in the input target. The proposed methodology addresses the limitations of existing approaches for reconstructing 3D human shapes from a single image, which cannot reproduce shape details in occluded body regions. The missing information of the monocular input can be recovered by using multiple views captured from multiple cameras. However, multi-view reconstruction methods necessitate accurately calibrated and registered images, which can be challenging to obtain in real-world scenarios. Given a target RGB image and a collection of multiple uncalibrated and unregistered images of the same individual, acquired using a single camera, we propose a novel framework to generate complete 3D human shapes. We introduce a novel module to generate 2D multi-view normal maps of the person registered with the target input image. The module consists of body part-based reference selection and body part-based registration. The generated 2D normal maps are then processed by a multi-view attention-based neural implicit model that estimates an implicit representation of the 3D shape, ensuring the reproduction of details in both observed and occluded regions. Extensive experiments demonstrate that the proposed approach estimates higher quality details in the non-visible regions of the 3D clothed human shapes compared to related methods, without using parametric models.

MMMar 11
Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints

Minsak Nanang, Adrian Hilton, Armin Mustafa

Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.

CVMar 15, 2024
ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image

Marco Pesavento, Yuanlu Xu, Nikolaos Sarafianos et al.

Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.

CVMay 17, 2024
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson et al.

Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively.

CVJun 17, 2025
Risk Estimation of Knee Osteoarthritis Progression via Predictive Multi-task Modelling from Efficient Diffusion Model using X-ray Images

David Butler, Adrian Hilton, Gustavo Carneiro

Medical imaging plays a crucial role in assessing knee osteoarthritis (OA) risk by enabling early detection and disease monitoring. Recent machine learning methods have improved risk estimation (i.e., predicting the likelihood of disease progression) and predictive modelling (i.e., the forecasting of future outcomes based on current data) using medical images, but clinical adoption remains limited due to their lack of interpretability. Existing approaches that generate future images for risk estimation are complex and impractical. Additionally, previous methods fail to localize anatomical knee landmarks, limiting interpretability. We address these gaps with a new interpretable machine learning method to estimate the risk of knee OA progression via multi-task predictive modelling that classifies future knee OA severity and predicts anatomical knee landmarks from efficiently generated high-quality future images. Such image generation is achieved by leveraging a diffusion model in a class-conditioned latent space to forecast disease progression, offering a visual representation of how particular health conditions may evolve. Applied to the Osteoarthritis Initiative dataset, our approach improves the state-of-the-art (SOTA) by 2\%, achieving an AUC of 0.71 in predicting knee OA progression while offering ~9% faster inference time.

GRMar 18, 2025
Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation

Umar Farooq, Jean-Yves Guillemaut, Adrian Hilton et al.

The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.

CVMar 14, 2025
Enhanced Multi-View Pedestrian Detection Using Probabilistic Occupancy Volume

Reef Alturki, Adrian Hilton, Jean-Yves Guillemaut

Occlusion poses a significant challenge in pedestrian detection from a single view. To address this, multi-view detection systems have been utilized to aggregate information from multiple perspectives. Recent advances in multi-view detection utilized an early-fusion strategy that strategically projects the features onto the ground plane, where detection analysis is performed. A promising approach in this context is the use of 3D feature-pulling technique, which constructs a 3D feature volume of the scene by sampling the corresponding 2D features for each voxel. However, it creates a 3D feature volume of the whole scene without considering the potential locations of pedestrians. In this paper, we introduce a novel model that efficiently leverages traditional 3D reconstruction techniques to enhance deep multi-view pedestrian detection. This is accomplished by complementing the 3D feature volume with probabilistic occupancy volume, which is constructed using the visual hull technique. The probabilistic occupancy volume focuses the model's attention on regions occupied by pedestrians and improves detection accuracy. Our model outperforms state-of-the-art models on the MultiviewX dataset, with an MODA of 97.3%, while achieving competitive performance on the Wildtrack dataset.

CVOct 17, 2024
360U-Former: HDR Illumination Estimation with Panoramic Adapted Vision Transformers

Jack Hilliard, Adrian Hilton, Jean-Yves Guillemaut

Recent illumination estimation methods have focused on enhancing the resolution and improving the quality and diversity of the generated textures. However, few have explored tailoring the neural network architecture to the Equirectangular Panorama (ERP) format utilised in image-based lighting. Consequently, high dynamic range images (HDRI) results usually exhibit a seam at the side borders and textures or objects that are warped at the poles. To address this shortcoming we propose a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format. To the best of our knowledge, this is the first purely Vision-Transformer model used in the field of illumination estimation. We train 360U-Former as a GAN to generate HDRI from a limited field of view low dynamic range image (LDRI). We evaluate our method using current illumination estimation evaluation protocols and datasets, demonstrating that our approach outperforms existing and state-of-the-art methods without the artefacts typically associated with the use of the ERP format.

CVJul 28, 2025
HDR Environment Map Estimation with Latent Diffusion Models

Jack Hilliard, Adrian Hilton, Jean-Yves Guillemaut

We advance the field of HDR environment map estimation from a single-view image by establishing a novel approach leveraging the Latent Diffusion Model (LDM) to produce high-quality environment maps that can plausibly light mirror-reflective surfaces. A common issue when using the ERP representation, the format used by the vast majority of approaches, is distortions at the poles and a seam at the sides of the environment map. We remove the border seam artefact by proposing an ERP convolutional padding in the latent autoencoder. Additionally, we investigate whether adapting the diffusion network architecture to the ERP format can improve the quality and accuracy of the estimated environment map by proposing a panoramically-adapted Diffusion Transformer architecture. Our proposed PanoDiT network reduces ERP distortions and artefacts, but at the cost of image quality and plausibility. We evaluate with standard benchmarks to demonstrate that our models estimate high-quality environment maps that perform competitively with state-of-the-art approaches in both image quality and lighting accuracy.

CVApr 3, 2025
Attention-Aware Multi-View Pedestrian Tracking

Reef Alturki, Adrian Hilton, Jean-Yves Guillemaut

In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird's Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of $96.1\%$ on Wildtrack dataset, and $85.7\%$ on MultiviewX dataset.

CVFeb 25, 2025
Realistic Clothed Human and Object Joint Reconstruction from a Single Image

Ayushi Dutta, Marco Pesavento, Marco Volino et al.

Recent approaches to jointly reconstruct 3D humans and objects from a single RGB image represent 3D shapes with template-based or coarse models, which fail to capture details of loose clothing on human bodies. In this paper, we introduce a novel implicit approach for jointly reconstructing realistic 3D clothed humans and objects from a monocular view. For the first time, we model both the human and the object with an implicit representation, allowing to capture more realistic details such as clothing. This task is extremely challenging due to human-object occlusions and the lack of 3D information in 2D images, often leading to poor detail reconstruction and depth ambiguity. To address these problems, we propose a novel attention-based neural implicit model that leverages image pixel alignment from both the input human-object image for a global understanding of the human-object scene and from local separate views of the human and object images to improve realism with, for example, clothing details. Additionally, the network is conditioned on semantic features derived from an estimated human-object pose prior, which provides 3D spatial information about the shared space of humans and objects. To handle human occlusion caused by objects, we use a generative diffusion model that inpaints the occluded regions, recovering otherwise lost details. For training and evaluation, we introduce a synthetic dataset featuring rendered scenes of inter-occluded 3D human scans and diverse objects. Extensive evaluation on both synthetic and real-world datasets demonstrates the superior quality of the proposed human-object reconstructions over competitive methods.

CVJan 30, 2025
Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson et al.

Dense action detection involves detecting multiple co-occurring actions while action classes are often ambiguous and represent overlapping concepts. We argue that handling the dual challenge of temporal and class overlaps is too complex to effectively be tackled by a single network. To address this, we propose to decompose the task of detecting dense ambiguous actions into detecting dense, unambiguous sub-concepts that form the action classes (i.e., action entities and action motions), and assigning these sub-tasks to distinct sub-networks. By isolating these unambiguous concepts, the sub-networks can focus exclusively on resolving a single challenge, dense temporal overlaps. Furthermore, simultaneous actions in a video often exhibit interrelationships, and exploiting these relationships can improve the method performance. However, current dense action detection networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial improvements of 3.8% and 1.7% on average across all metrics on the challenging benchmark datasets, Charades and MultiTHUMOS.

CVJun 20, 2024
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data

Moira Shooter, Charles Malleson, Adrian Hilton

We introduce a new benchmark analysis focusing on 3D canine pose estimation from monocular in-the-wild images. A multi-modal dataset 3DDogs-Lab was captured indoors, featuring various dog breeds trotting on a walkway. It includes data from optical marker-based mocap systems, RGBD cameras, IMUs, and a pressure mat. While providing high-quality motion data, the presence of optical markers and limited background diversity make the captured video less representative of real-world conditions. To address this, we created 3DDogs-Wild, a naturalised version of the dataset where the optical markers are in-painted and the subjects are placed in diverse environments, enhancing its utility for training RGB image-based pose detectors. We show that using the 3DDogs-Wild to train the models leads to improved performance when evaluating on in-the-wild data. Additionally, we provide a thorough analysis using various pose estimation models, revealing their respective strengths and weaknesses. We believe that our findings, coupled with the datasets provided, offer valuable insights for advancing 3D animal pose estimation.

CVJun 10, 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

Asmar Nadeem, Faegheh Sardari, Robert Dawes et al.

Existing video captioning benchmarks and models lack causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models in articulating the causal and temporal aspects of video content: 17.88 and 17.44 CIDEr on the MSVD-CTN and MSRVTT-CTN datasets, respectively. Cross-dataset evaluations further showcase CEN's strong generalization capabilities. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning. For project details, visit https://narrativebridge.github.io/.

CVJun 10, 2024
An Effective-Efficient Approach for Dense Multi-Label Action Detection

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson et al.

Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. Furthermore, to model co-occurrence action relationships, current methods explicitly embed class relations into the transformer network. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. We also overcome this challenge by introducing a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies without imposing their additional computational costs during inference. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results.

CVJun 6, 2024
Improving Gaussian Splatting with Localized Points Management

Haosen Yang, Chenhao Zhang, Wenqing Wang et al.

Point management is critical for optimizing 3D Gaussian Splatting models, as point initiation (e.g., via structure from motion) is often distributionally inappropriate. Typically, Adaptive Density Control (ADC) algorithm is adopted, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. We reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) due to inability of identifying all 3D zones requiring point densification, and lacking an appropriate mechanism to handle ill-conditioned points with negative impacts (e.g., occlusion due to false high opacity). To address these limitations, we propose a Localized Point Management (LPM) strategy, capable of identifying those error-contributing zones in greatest need for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, subject to image rendering errors. We apply point densification in the identified zones and then reset the opacity of the points in front of these regions, creating a new opportunity to correct poorly conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing static 3D and dynamic 4D Gaussian Splatting models with minimal additional cost. Experimental evaluations validate the efficacy of our LPM in boosting a variety of existing 3D/4D models both quantitatively and qualitatively. Notably, LPM improves both static 3DGS and dynamic SpaceTimeGS to achieve state-of-the-art rendering quality while retaining real-time speeds, excelling on challenging datasets such as Tanks & Temples and the Neural 3D Video dataset.

CVAug 31, 2021
Super-Resolution Appearance Transfer for 4D Human Performances

Marco Pesavento, Marco Volino, Adrian Hilton

A common problem in the 4D reconstruction of people from multi-view video is the quality of the captured dynamic texture appearance which depends on both the camera resolution and capture volume. Typically the requirement to frame cameras to capture the volume of a dynamic performance ($>50m^3$) results in the person occupying only a small proportion $<$ 10% of the field of view. Even with ultra high-definition 4k video acquisition this results in sampling the person at less-than standard definition 0.5k video resolution resulting in low-quality rendering. In this paper we propose a solution to this problem through super-resolution appearance transfer from a static high-resolution appearance capture rig using digital stills cameras ($> 8k$) to capture the person in a small volume ($<8m^3$). A pipeline is proposed for super-resolution appearance transfer from high-resolution static capture to dynamic video performance capture to produce super-resolution dynamic textures. This addresses two key problems: colour mapping between different camera systems; and dynamic texture map super-resolution using a learnt model. Comparative evaluation demonstrates a significant qualitative and quantitative improvement in rendering the 4D performance capture with super-resolution dynamic texture appearance. The proposed approach reproduces the high-resolution detail of the static capture whilst maintaining the appearance dynamics of the captured video.

CVAug 31, 2021
Attention-based Multi-Reference Learning for Image Super-Resolution

Marco Pesavento, Marco Volino, Adrian Hilton

This paper proposes a novel Attention-based Multi-Reference Super-resolution network (AMRSR) that, given a low-resolution image, learns to adaptively transfer the most similar texture from multiple reference images to the super-resolution output whilst maintaining spatial coherence. The use of multiple reference images together with attention-based sampling is demonstrated to achieve significantly improved performance over state-of-the-art reference super-resolution approaches on multiple benchmark datasets. Reference super-resolution approaches have recently been proposed to overcome the ill-posed problem of image super-resolution by providing additional information from a high-resolution reference image. Multi-reference super-resolution extends this approach by providing a more diverse pool of image features to overcome the inherent information deficit whilst maintaining memory efficiency. A novel hierarchical attention-based sampling approach is introduced to learn the similarity between low-resolution image features and multiple reference images based on a perceptual loss. Ablation demonstrates the contribution of both multi-reference and hierarchical attention-based sampling to overall performance. Perceptual and quantitative ground-truth evaluation demonstrates significant improvement in performance even when the reference images deviate significantly from the target image. The project website can be found at https://marcopesavento.github.io/AMRSR/

CVJul 31, 2021
SyDog: A Synthetic Dog Dataset for Improved 2D Pose Estimation

Moira Shooter, Charles Malleson, Adrian Hilton

Estimating the pose of animals can facilitate the understanding of animal motion which is fundamental in disciplines such as biomechanics, neuroscience, ethology, robotics and the entertainment industry. Human pose estimation models have achieved high performance due to the huge amount of training data available. Achieving the same results for animal pose estimation is challenging due to the lack of animal pose datasets. To address this problem we introduce SyDog: a synthetic dataset of dogs containing ground truth pose and bounding box coordinates which was generated using the game engine, Unity. We demonstrate that pose estimation models trained on SyDog achieve better performance than models trained purely on real data and significantly reduce the need for the labour intensive labelling of images. We release the SyDog dataset as a training and evaluation benchmark for research in animal motion.

CVApr 19, 2021
Multi-person Implicit Reconstruction from a Single Image

Armin Mustafa, Akin Caliskan, Lourdes Agapito et al.

We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real-world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches.

CVApr 19, 2021
Temporal Consistency Loss for High Resolution Textured and Clothed 3DHuman Reconstruction from Monocular Video

Akin Caliskan, Armin Mustafa, Adrian Hilton

We present a novel method to learn temporally consistent 3D reconstruction of clothed people from a monocular video. Recent methods for 3D human reconstruction from monocular video using volumetric, implicit or parametric human shape models, produce per frame reconstructions giving temporally inconsistent output and limited performance when applied to video. In this paper, we introduce an approach to learn temporally consistent features for textured reconstruction of clothed 3D human sequences from monocular video by proposing two advances: a novel temporal consistency loss function; and hybrid representation learning for implicit 3D reconstruction from 2D images and coarse 3D geometry. The proposed advances improve the temporal consistency and accuracy of both the 3D reconstruction and texture prediction from a monocular video. Comprehensive comparative performance evaluation on images of people demonstrates that the proposed method significantly outperforms the state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, quality and temporal consistency.

CVSep 29, 2020
Multi-View Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People

Akin Caliskan, Armin Mustafa, Evren Imre et al.

We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape resulting from clothing, hair, body size, pose and camera viewpoint. This paper introduces two advances to overcome this limitation: firstly a new synthetic dataset of realistic clothed people, 3DVH; and secondly, a novel multiple-view loss function for training of monocular volumetric shape estimation, which is demonstrated to significantly improve generalisation and reconstruction accuracy. The 3DVH dataset of realistic clothed 3D human models rendered with diverse natural backgrounds is demonstrated to allows transfer to reconstruction from real images of people. Comprehensive comparative performance evaluation on both synthetic and real images of people demonstrates that the proposed method significantly outperforms the previous state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, and quality. An ablation study shows that this is due to both the proposed multiple-view training and the new 3DVH dataset. The code and the dataset can be found at the project website: https://akincaliskan3d.github.io/MV3DH/.

CVSep 11, 2020
Spectral Analysis Network for Deep Representation Learning and Image Clustering

Jinghua Wang, Adrian Hilton, Jianmin Jiang

Deep representation learning is a crucial procedure in multimedia analysis and attracts increasing attention. Most of the popular techniques rely on convolutional neural network and require a large amount of labeled data in the training procedure. However, it is time consuming or even impossible to obtain the label information in some tasks due to cost limitation. Thus, it is necessary to develop unsupervised deep representation learning techniques. This paper proposes a new network structure for unsupervised deep representation learning based on spectral analysis, which is a popular technique with solid theory foundations. Compared with the existing spectral analysis methods, the proposed network structure has at least three advantages. Firstly, it can identify the local similarities among images in patch level and thus more robust against occlusion. Secondly, through multiple consecutive spectral analysis procedures, the proposed network can learn more clustering-friendly representations and is capable to reveal the deep correlations among data samples. Thirdly, it can elegantly integrate different spectral analysis procedures, so that each spectral analysis procedure can have their individual strengths in dealing with different data sample distributions. Extensive experimental results show the effectiveness of the proposed methods on various image clustering tasks.

ASMar 14, 2020
Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

Davide Berghi, Hanne Stenzel, Marco Volino et al.

Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were selected to represent three acoustic feature classes. Subjective test results in a simulated virtual environment from 18 participants indicate a wider capture region in the periphery, with an outward bias favoring more lateral sounds. Centered stimulus results support previous findings for simpler scenes.

CVOct 2, 2019
Learning Dense Wide Baseline Stereo Matching for People

Akin Caliskan, Armin Mustafa, Evren Imre et al.

Existing methods for stereo work on narrow baseline image pairs giving limited performance between wide baseline views. This paper proposes a framework to learn and estimate dense stereo for people from wide baseline image pairs. A synthetic people stereo patch dataset (S2P2) is introduced to learn wide baseline dense stereo matching for people. The proposed framework not only learns human specific features from synthetic data but also exploits pooling layer and data augmentation to adapt to real data. The network learns from the human specific stereo patches from the proposed dataset for wide-baseline stereo estimation. In addition to patch match learning, a stereo constraint is introduced in the framework to solve wide baseline stereo reconstruction of humans. Quantitative and qualitative performance evaluation against state-of-the-art methods of proposed method demonstrates improved wide baseline stereo reconstruction on challenging datasets. We show that it is possible to learn stereo matching from synthetic people dataset and improve performance on real datasets for stereo reconstruction of people from narrow and wide baseline stereo data.

CVAug 8, 2019
Semantic Estimation of 3D Body Shape and Pose using Minimal Cameras

Andrew Gilbert, Matthew Trumble, Adrian Hilton et al.

We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the performance. The inference is regularised via a prior learned over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions, and show this to generalise well across unseen subjects and actions. We demonstrate improved reconstruction accuracy and lower pose estimation error relative to prior work on two MVV performance capture datasets: Human 3.6M and TotalCapture.

CVAug 8, 2019
EdgeNet: Semantic Scene Completion from a Single RGB-D Image

Aloisio Dourado, Teofilo Emidio de Campos, Hansung Kim et al.

Semantic scene completion is the task of predicting a complete 3D representation of volumetric occupancy with corresponding semantic labels for a scene from a single point of view. Previous works on Semantic Scene Completion from RGB-D data used either only depth or depth with colour by projecting the 2D image into the 3D volume resulting in a sparse data representation. In this work, we present a new strategy to encode colour information in 3D space using edge detection and flipped truncated signed distance. We also present EdgeNet, a new end-to-end neural network architecture capable of handling features generated from the fusion of depth and edge information. Experimental results show improvement of 6.9% over the state-of-the-art result on real data, for end-to-end approaches.

CVJul 23, 2019
U4D: Unsupervised 4D Dynamic Scene Understanding

Armin Mustafa, Chris Russell, Adrian Hilton

We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (approx 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.

CVJul 18, 2019
Temporally Coherent General Dynamic Scene Reconstruction

Armin Mustafa, Marco Volino, Hansung Kim et al.

Existing techniques for dynamic scene reconstruction from multiple wide-baseline cameras primarily focus on reconstruction in controlled environments, with fixed calibrated cameras and strong prior constraints. This paper introduces a general approach to obtain a 4D representation of complex dynamic scenes from multi-view wide-baseline static or moving cameras without prior knowledge of the scene structure, appearance, or illumination. Contributions of the work are: An automatic method for initial coarse reconstruction to initialize joint estimation; Sparse-to-dense temporal correspondence integrated with joint multi-view segmentation and reconstruction to introduce temporal coherence; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes by introducing shape constraint. Comparison with state-of-the-art approaches on a variety of complex indoor and outdoor scenes, demonstrates improved accuracy in both multi-view segmentation and dense reconstruction. This paper demonstrates unsupervised reconstruction of complete temporally coherent 4D scene models with improved non-rigid object segmentation and shape reconstruction and its application to free-viewpoint rendering and virtual reality.

CVJul 5, 2018
Volumetric performance capture from minimal camera viewpoints

Andrew Gilbert, Marco Volino, John Collomosse et al.

We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a probabilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.

CVJul 4, 2018
Deep Autoencoder for Combined Human Pose Estimation and body Model Upscaling

Matthew Trumble, Andrew Gilbert, Adrian Hilton et al.

We present a method for simultaneously estimating 3D human pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of $4 \times$, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose.

CVApr 30, 2018
4D Temporally Coherent Light-field Video

Armin Mustafa, Marco Volino, Jean-yves Guillemaut et al.

Light-field video has recently been used in virtual and augmented reality applications to increase realism and immersion. However, existing light-field methods are generally limited to static scenes due to the requirement to acquire a dense scene representation. The large amount of data and the absence of methods to infer temporal coherence pose major challenges in storage, compression and editing compared to conventional video. In this paper, we propose the first method to extract a spatio-temporally coherent light-field video representation. A novel method to obtain Epipolar Plane Images (EPIs) from a spare light-field camera array is proposed. EPIs are used to constrain scene flow estimation to obtain 4D temporally coherent representations of dynamic light-fields. Temporal coherence is achieved on a variety of light-field datasets. Evaluation of the proposed light-field scene flow against existing multi-view dense correspondence approaches demonstrates a significant improvement in accuracy of temporal coherence.

CVFeb 13, 2018
Semantic Scene Completion Combining Colour and Depth: preliminary experiments

Andre Bernardes Soares Guedes, Teofilo Emidio de Campos, Adrian Hilton

Semantic scene completion is the task of producing a complete 3D voxel representation of volumetric occupancy with semantic labels for a scene from a single-view observation. We built upon the recent work of Song et al. (CVPR 2017), who proposed SSCnet, a method that performs scene completion and semantic labelling in a single end-to-end 3D convolutional network. SSCnet uses only depth maps as input, even though depth maps are usually obtained from devices that also capture colour information, such as RGBD sensors and stereo cameras. In this work, we investigate the potential of the RGB colour channels to improve SSCnet.

SDAug 23, 2017
Object-Based Audio Rendering

Philip Jackson, Filippo Fazi, Frank Melchior et al.

Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus comprises: a plurality of renderers each capable of rendering one or more of the audio objects to output rendered audio data; and object adapting means for adapting one or more of the plurality of audio objects for a current reproduction scenario, the object adapting means being configured to send the adapted one or more audio objects to one or more of the plurality of renderers.

CVMar 10, 2016
Temporally coherent 4D reconstruction of complex dynamic scenes

Armin Mustafa, Hansung Kim, Jean-Yves Guillemaut et al.

This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.

CVSep 30, 2015
General Dynamic Scene Reconstruction from Multiple View Video

Armin Mustafa, Hansung Kim, Jean-Yves Guillemaut et al.

This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques for dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure; and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.