IVJan 19, 2023
The role of noise in denoising models for anomaly detection in medical imagesAntanas Kascenas, Pedro Sanchez, Patrick Schrempf et al.
Pathological brain lesions exhibit diverse appearance in brain images, in terms of intensity, texture, shape, size, and location. Comprehensive sets of data and annotations are difficult to acquire. Therefore, unsupervised anomaly detection approaches have been proposed using only normal data for training, with the aim of detecting outlier anomalous voxels at test time. Denoising methods, for instance classical denoising autoencoders (DAEs) and more recently emerging diffusion models, are a promising approach, however naive application of pixelwise noise leads to poor anomaly detection performance. We show that optimization of the spatial resolution and magnitude of the noise improves the performance of different model training regimes, with similar noise parameter adjustments giving good performance for both DAEs and diffusion models. Visual inspection of the reconstructions suggests that the training noise influences the trade-off between the extent of the detail that is reconstructed and the extent of erasure of anomalies, both of which contribute to better anomaly detection performance. We validate our findings on two real-world datasets (tumor detection in brain MRI and hemorrhage/ischemia/tumor detection in brain CT), showing good detection on diverse anomaly appearances. Overall, we find that a DAE trained with coarse noise is a fast and simple method that gives state-of-the-art accuracy. Diffusion models applied to anomaly detection are as yet in their infancy and provide a promising avenue for further research.
CVJan 26Code
Splat-Portrait: Generalizing Talking Heads with Gaussian SplattingTong Shi, Melonie de Almeida, Daniela Ivanova et al.
Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image. Previous 3D talking head generation methods have relied on domain-specific heuristics such as warping-based facial motion representation priors to animate talking motions, yet still produce inaccurate 3D avatar reconstructions, thus undermining the realism of generated animations. We introduce Splat-Portrait, a Gaussian-splatting-based method that addresses the challenges of 3D head reconstruction and lip motion synthesis. Our approach automatically learns to disentangle a single portrait image into a static 3D reconstruction represented as static Gaussian Splatting, and a predicted whole-image 2D background. It then generates natural lip motion conditioned on input audio, without any motion driven priors. Training is driven purely by 2D reconstruction and score-distillation losses, without 3D supervision nor landmarks. Experimental results demonstrate that Splat-Portrait exhibits superior performance on talking head generation and novel view synthesis, achieving better visual quality compared to previous works. Our project code and supplementary documents are public available at https://github.com/stonewalking/Splat-portrait.
CVFeb 9Code
Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose ForecastingGuangxun Zhu, Xuan Liu, Nicolas Pugeault et al.
Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian-vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian-vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian-vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving. Code is available at: https://github.com/GuangxunZhu/VehCondPose3D
CVMar 6, 2019Code
Human Attention in Image Captioning: Dataset and AnalysisSen He, Hamed R. Tavakoli, Ali Borji et al.
In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects ($97\%$ of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around $78\%$), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: https://github.com/SenHe/Human-Attention-in-Image-Captioning.
LGMay 8
Enhancing Federated Quadruplet Learning: Stochastic Client Selection and Embedding Stability AnalysisOzgu Goksu, Nicolas Pugeault
Federated Learning (FL) enables decentralised model training across distributed clients without requiring data centralisation. However, the generalisation performance of the global model is usually degraded by data heterogeneity across clients, particularly under limited data availability and class imbalance. To address this challenge, we propose FedQuad, a novel method that explicitly enforces minimising intra-class representations while enabling inter-class splits across clients. By jointly minimising distances between positive pairs and maximising distances between negative pairs, the proposed approach mitigates representation misalignment introduced during model aggregation. We evaluate our method on CIFAR-10, CIFAR-100, and Tiny-ImageNet under diverse non-IID settings and varying numbers of clients, demonstrating consistent improvements over existing baselines. Additionally, we provide a comprehensive analysis of metric learning-based approaches in both centralised and federated environments, highlighting their effectiveness in alleviating representation collapse under heterogeneous data distributions.
CVDec 22, 2025
A Convolutional Neural Deferred Shader for Physics Based RenderingZhuo He, Yingdong Ru, Qianying Liu et al.
Recent advances in neural rendering have achieved impressive results on photorealistic shading and relighting, by using a multilayer perceptron (MLP) as a regression model to learn the rendering equation from a real-world dataset. Such methods show promise for photorealistically relighting real-world objects, which is difficult to classical rendering, as there is no easy-obtained material ground truth. However, significant challenges still remain the dense connections in MLPs result in a large number of parameters, which requires high computation resources, complicating the training, and reducing performance during rendering. Data driven approaches require large amounts of training data for generalization; unbalanced data might bias the model to ignore the unusual illumination conditions, e.g. dark scenes. This paper introduces pbnds+: a novel physics-based neural deferred shading pipeline utilizing convolution neural networks to decrease the parameters and improve the performance in shading and relighting tasks; Energy regularization is also proposed to restrict the model reflection during dark illumination. Extensive experiments demonstrate that our approach outperforms classical baselines, a state-of-the-art neural shading model, and a diffusion-based method.
LGApr 25
GIFT: Global stabilisation via Intrinsic Fine TuningRory Young, Nicolas Pugeault
Deep reinforcement learning policies achieve strong performance in complex continuous control environments with nonlinear contact forces. However, these policies often produce chaotic state dynamics, with trivially small changes to the initial conditions significantly impacting the long-term behaviour of the control system. This high sensitivity to initial conditions limits the application of Deep RL to real-world control systems where performance and stability guarantees are often required. To address this issue, we propose Global stabilisation via Intrinsic Fine Tuning (GIFT), a general-purpose training framework which directly optimises the global stability of existing high-performing deep RL policies using a custom reward function. We demonstrate that GIFT increase the stability of the control interaction while maintaining comparable task performance, thereby improving the suitability of deep RL policies for real-world control systems.
LGOct 14, 2024
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent ApproachRory Young, Nicolas Pugeault
Deep reinforcement learning agents achieve state-of-the-art performance in a wide range of simulated control tasks. However, successful applications to real-world problems remain limited. One reason for this dichotomy is because the learnt policies are not robust to observation noise or adversarial attacks. In this paper, we investigate the robustness of deep RL policies to a single small state perturbation in deterministic continuous control tasks. We demonstrate that RL policies can be deterministically chaotic, as small perturbations to the system state have a large impact on subsequent state and reward trajectories. This unstable non-linear behaviour has two consequences: first, inaccuracies in sensor readings, or adversarial attacks, can cause significant performance degradation; second, even policies that show robust performance in terms of rewards may have unpredictable behaviour in practice. These two facets of chaos in RL policies drastically restrict the application of deep RL to real-world problems. To address this issue, we propose an improvement on the successful Dreamer V3 architecture, implementing Maximal Lyapunov Exponent regularisation. This new approach reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of deep reinforcement learning for real-world applications.
CVJan 17, 2025
On the Benefits of Instance Decomposition in Video Prediction ModelsEliyas Suleyman, Paul Henderson, Nicolas Pugeault
Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, since it enables them to anticipate and act early on time-critical incidents. State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition.
LGDec 19, 2024
Hybrid-Regularized Magnitude Pruning for Robust Federated Learning under Covariate ShiftOzgu Goksu, Nicolas Pugeault
Federated Learning offers a solution for decentralised model training, addressing the difficulties associated with distributed data and privacy in machine learning. However, the fact of data heterogeneity in federated learning frequently hinders the global model's generalisation, leading to low performance and adaptability to unseen data. This problem is particularly critical for specialised applications such as medical imaging, where both the data and the number of clients are limited. In this paper, we empirically demonstrate that inconsistencies in client-side training distributions substantially degrade the performance of federated learning models across multiple benchmark datasets. We propose a novel FL framework using a combination of pruning and regularisation of clients' training to improve the sparsity, redundancy, and robustness of neural connections, and thereby the resilience to model aggregation. To address a relatively unexplored dimension of data heterogeneity, we further introduce a novel benchmark dataset, CelebA-Gender, specifically designed to control for within-class distributional shifts across clients based on attribute variations, thereby complementing the predominant focus on inter-class imbalance in prior federated learning research. Comprehensive experiments on many datasets like CIFAR-10, MNIST, and the newly introduced CelebA-Gender dataset demonstrate that our method consistently outperforms standard FL baselines, yielding more robust and generalizable models in heterogeneous settings.
CVNov 20, 2025
Flow and Depth Assisted Video Prediction with Latent TransformerEliyas Suleyman, Paul Henderson, Eksan Firkat et al.
Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
LGSep 4, 2025
FedQuad: Federated Stochastic Quadruplet Learning to Mitigate Data HeterogeneityOzgu Goksu, Nicolas Pugeault
Federated Learning (FL) provides decentralised model training, which effectively tackles problems such as distributed data and privacy preservation. However, the generalisation of global models frequently faces challenges from data heterogeneity among clients. This challenge becomes even more pronounced when datasets are limited in size and class imbalance. To address data heterogeneity, we propose a novel method, \textit{FedQuad}, that explicitly optimises smaller intra-class variance and larger inter-class variance across clients, thereby decreasing the negative impact of model aggregation on the global model over client representations. Our approach minimises the distance between similar pairs while maximising the distance between negative pairs, effectively disentangling client data in the shared feature space. We evaluate our method on the CIFAR-10 and CIFAR-100 datasets under various data distributions and with many clients, demonstrating superior performance compared to existing approaches. Furthermore, we provide a detailed analysis of metric learning-based strategies within both supervised and federated learning paradigms, highlighting their efficacy in addressing representational learning challenges in federated settings.
CVApr 24, 2025
Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive FieldsZhuo He, Paul Henderson, Nicolas Pugeault
StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of "generative fields" to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.
CVApr 16, 2025
Beyond Reconstruction: A Physics Based Neural Deferred Shader for Photo-realistic RenderingZhuo He, Paul Henderson, Nicolas Pugeault
Deep learning based rendering has achieved major improvements in photo-realistic image synthesis, with potential applications including visual effects in movies and photo-realistic scene building in video games. However, a significant limitation is the difficulty of decomposing the illumination and material parameters, which limits such methods to reconstructing an input scene, without any possibility to control these parameters. This paper introduces a novel physics based neural deferred shading pipeline to decompose the data-driven rendering process, learn a generalizable shading function to produce photo-realistic results for shading and relighting tasks; we also propose a shadow estimator to efficiently mimic shadowing effects. Our model achieves improved performance compared to classical models and a state-of-art neural shading model, and enables generalizable photo-realistic shading from arbitrary illumination input.
CVMar 28, 2024
The Bad Batches: Enhancing Self-Supervised Learning in Image Classification Through Representative Batch CurationOzgu Goksu, Nicolas Pugeault
The pursuit of learning robust representations without human supervision is a longstanding challenge. The recent advancements in self-supervised contrastive learning approaches have demonstrated high performance across various representation learning challenges. However, current methods depend on the random transformation of training examples, resulting in some cases of unrepresentative positive pairs that can have a large impact on learning. This limitation not only impedes the convergence of the learning process but the robustness of the learnt representation as well as requiring larger batch sizes to improve robustness to such bad batches. This paper attempts to alleviate the influence of false positive and false negative pairs by employing pairwise similarity calculations through the Fréchet ResNet Distance (FRD), thereby obtaining robust representations from unlabelled data. The effectiveness of the proposed method is substantiated by empirical results, where a linear classifier trained on self-supervised contrastive representations achieved an impressive 87.74\% top-1 accuracy on STL10 and 99.31\% on the Flower102 dataset. These results emphasize the potential of the proposed approach in pushing the boundaries of the state-of-the-art in self-supervised contrastive learning, particularly for image classification tasks.
MLJan 25, 2022
A deep mixture density network for outlier-corrected interpolation of crowd-sourced weather dataCharlie Kirkwood, Theo Economou, Henry Odbert et al.
As the costs of sensors and associated IT infrastructure decreases - as exemplified by the Internet of Things - increasing volumes of observational data are becoming available for use by environmental scientists. However, as the number of available observation sites increases, so too does the opportunity for data quality issues to emerge, particularly given that many of these sensors do not have the benefit of official maintenance teams. To realise the value of crowd sourced 'Internet of Things' type observations for environmental modelling, we require approaches that can automate the detection of outliers during the data modelling process so that they do not contaminate the true distribution of the phenomena of interest. To this end, here we present a Bayesian deep learning approach for spatio-temporal modelling of environmental variables with automatic outlier detection. Our approach implements a Gaussian-uniform mixture density network whose dual purposes - modelling the phenomenon of interest, and learning to classify and ignore outliers - are achieved simultaneously, each by specifically designed branches of our neural network. For our example application, we use the Met Office's Weather Observation Website data, an archive of observations from around 1900 privately run and unofficial weather stations across the British Isles. Using data on surface air temperature, we demonstrate how our deep mixture model approach enables the modelling of a highly skilled spatio-temporal temperature distribution without contamination from spurious observations. We hope that adoption of our approach will help unlock the potential of incorporating a wider range of observation sources, including from crowd sourcing, into future environmental models.
CVNov 13, 2020
Transformer-Encoder Detector Module: Using Context to Improve Robustness to Adversarial Attacks on Object DetectionFaisal Alamri, Sinan Kalkan, Nicolas Pugeault
Deep neural network approaches have demonstrated high performance in object recognition (CNN) and detection (Faster-RCNN) tasks, but experiments have shown that such architectures are vulnerable to adversarial attacks (FFF, UAP): low amplitude perturbations, barely perceptible by the human eye, can lead to a drastic reduction in labeling performance. This article proposes a new context module, called \textit{Transformer-Encoder Detector Module}, that can be applied to an object detector to (i) improve the labeling of object instances; and (ii) improve the detector's robustness to adversarial attacks. The proposed model achieves higher mAP, F1 scores and AUC average score of up to 13\% compared to the baseline Faster-RCNN detector, and an mAP score 8 points higher on images subjected to FFF or UAP attacks due to the inclusion of both contextual and visual features extracted from scene and encoded into the model. The result demonstrates that a simple ad-hoc context module can improve the reliability of object detectors significantly.
MLAug 17, 2020
Bayesian deep learning for mapping via auxiliary information: a new era for geostatistics?Charlie Kirkwood, Theo Economou, Nicolas Pugeault
For geospatial modelling and mapping tasks, variants of kriging - the spatial interpolation technique developed by South African mining engineer Danie Krige - have long been regarded as the established geostatistical methods. However, kriging and its variants (such as regression kriging, in which auxiliary variables or derivatives of these are included as covariates) are relatively restrictive models and lack capabilities that have been afforded to us in the last decade by deep neural networks. Principal among these is feature learning - the ability to learn filters to recognise task-specific patterns in gridded data such as images. Here we demonstrate the power of feature learning in a geostatistical context, by showing how deep neural networks can automatically learn the complex relationships between point-sampled target variables and gridded auxiliary variables (such as those provided by remote sensing), and in doing so produce detailed maps of chosen target variables. At the same time, in order to cater for the needs of decision makers who require well-calibrated probabilities, we obtain uncertainty estimates via a Bayesian approximation known as Monte Carlo dropout. In our example, we produce a national-scale probabilistic geochemical map from point-sampled assay data, with auxiliary information provided by a terrain elevation grid. Unlike traditional geostatistical approaches, auxiliary variable grids are fed into our deep neural network raw. There is no need to provide terrain derivatives (e.g. slope angles, roughness, etc) because the deep neural network is capable of learning these and arbitrarily more complex derivatives as necessary to maximise predictive performance. We hope our results will raise awareness of the suitability of Bayesian deep learning - and its feature learning capabilities - for large-scale geostatistical applications where uncertainty matters.
CVMay 12, 2020
Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from IdentityMohammad Rami Koujan, Luma Alharbawee, Giorgos Giannakakis et al.
Human emotions analysis has been the focus of many studies, especially in the field of Affective Computing, and is important for many applications, e.g. human-computer intelligent interaction, stress analysis, interactive games, animations, etc. Solutions for automatic emotion analysis have also benefited from the development of deep learning approaches and the availability of vast amount of visual facial data on the internet. This paper proposes a novel method for human emotion recognition from a single RGB image. We construct a large-scale dataset of facial videos (\textbf{FaceVid}), rich in facial dynamics, identities, expressions, appearance and 3D pose variations. We use this dataset to train a deep Convolutional Neural Network for estimating expression parameters of a 3D Morphable Model and combine it with an effective back-end emotion classifier. Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation and accurately recognizing facial expressions from in-the-wild images. We present extensive experimental evaluation that shows that the proposed method outperforms the compared techniques in estimating the 3D expression parameters and achieves state-of-the-art performance in recognising the basic emotions from facial images, as well as recognising stress from facial videos. %compared to the current state of the art in emotion recognition from facial images.
APMay 6, 2020
A framework for probabilistic weather forecast post-processing across models and lead times using machine learningCharlie Kirkwood, Theo Economou, Henry Odbert et al.
Forecasting the weather is an increasingly data intensive exercise. Numerical Weather Prediction (NWP) models are becoming more complex, with higher resolutions, and there are increasing numbers of different models in operation. While the forecasting skill of NWP models continues to improve, the number and complexity of these models poses a new challenge for the operational meteorologist: how should the information from all available models, each with their own unique biases and limitations, be combined in order to provide stakeholders with well-calibrated probabilistic forecasts to use in decision making? In this paper, we use a road surface temperature example to demonstrate a three-stage framework that uses machine learning to bridge the gap between sets of separate forecasts from NWP models and the 'ideal' forecast for decision support: probabilities of future weather outcomes. First, we use Quantile Regression Forests to learn the error profile of each numerical model, and use these to apply empirically-derived probability distributions to forecasts. Second, we combine these probabilistic forecasts using quantile averaging. Third, we interpolate between the aggregate quantiles in order to generate a full predictive distribution, which we demonstrate has properties suitable for decision support. Our results suggest that this approach provides an effective and operationally viable framework for the cohesive post-processing of weather forecasts across multiple models and lead times to produce a well-calibrated probabilistic output.
CVApr 29, 2020
Image Captioning through Image TransformerSen He, Wentong Liao, Hamed R. Tavakoli et al.
Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
CVJun 6, 2019
Contextual Relabelling of Detected ObjectsFaisal Alamri, Nicolas Pugeault
Contextual information, such as the co-occurrence of objects and the spatial and relative size among objects provides deep and complex information about scenes. It also can play an important role in improving object detection. In this work, we present two contextual models (rescoring and re-labeling models) that leverage contextual information (16 contextual relationships are applied in this paper) to enhance the state-of-the-art RCNN-based object detection (Faster RCNN). We experimentally demonstrate that our models lead to enhancement in detection performance using the most common dataset used in this field (MSCOCO).
CVMar 6, 2019
Understanding and Visualizing Deep Visual Saliency ModelsSen He, Hamed R. Tavakoli, Ali Borji et al.
Recently, data-driven deep saliency models have achieved high performance and have outperformed classical saliency models, as demonstrated by results on datasets such as the MIT300 and SALICON. Yet, there remains a large gap between the performance of these models and the inter-human baseline. Some outstanding questions include what have these models learned, how and where they fail, and how they can be improved. This article attempts to answer these questions by analyzing the representations learned by individual neurons located at the intermediate layers of deep saliency models. To this end, we follow the steps of existing deep saliency models, that is borrowing a pre-trained model of object recognition to encode the visual features and learning a decoder to infer the saliency. We consider two cases when the encoder is used as a fixed feature extractor and when it is fine-tuned, and compare the inner representations of the network. To study how the learned representations depend on the task, we fine-tune the same network using the same image set but for two different tasks: saliency prediction versus scene classification. Our analyses reveal that: 1) some visual regions (e.g. head, text, symbol, vehicle) are already encoded within various layers of the network pre-trained for object recognition, 2) using modern datasets, we find that fine-tuning pre-trained models for saliency prediction makes them favor some categories (e.g. head) over some others (e.g. text), 3) although deep models of saliency outperform classical models on natural images, the converse is true for synthetic stimuli (e.g. pop-out search arrays), an evidence of significant difference between human and data-driven saliency models, and 4) we confirm that, after-fine tuning, the change in inner-representations is mostly due to the task and not the domain shift in the data.
LGJan 18, 2019
On-Policy Trust Region Policy Optimisation with Replay BuffersDmitry Kangin, Nicolas Pugeault
Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to create the method, combining advantages of on- and off-policy learning. To achieve this, the proposed algorithm generalises the $Q$-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.
CVMar 15, 2018
Aggregated Sparse Attention for Steering Angle PredictionSen He, Dmitry Kangin, Yang Mi et al.
In this paper, we apply the attention mechanism to autonomous driving for steering angle prediction. We propose the first model, applying the recently introduced sparse attention mechanism to visual domain, as well as the aggregated extension for this model. We show the improvement of the proposed method, comparing to no attention as well as to different types of attention.
CVMar 15, 2018
Salient Region SegmentationSen He, Nicolas Pugeault
Saliency prediction is a well studied problem in computer vision. Early saliency models were based on low-level hand-crafted feature derived from insights gained in neuroscience and psychophysics. In the wake of deep learning breakthrough, a new cohort of models were proposed based on neural network architectures, allowing significantly higher gaze prediction than previous shallow models, on all metrics. However, most models treat the saliency prediction as a \textit{regression} problem, and accurate regression of high-dimensional data is known to be a hard problem. Furthermore, it is unclear that intermediate levels of saliency (ie, neither very high, nor very low) are meaningful: Something is either salient, or it is not. Drawing from those two observations, we reformulate the saliency prediction problem as a salient region \textit{segmentation} problem. We demonstrate that the reformulation allows for faster convergence than the classical regression problem, while performance is comparable to state-of-the-art. We also visualise the general features learned by the model, which are showed to be consistent with insights from psychophysics.
CVMar 15, 2018
What Catches the Eye? Visualizing and Understanding Deep Saliency ModelsSen He, Ali Borji, Yang Mi et al.
Deep convolutional neural networks have demonstrated high performances for fixation prediction in recent years. How they achieve this, however, is less explored and they remain to be black box models. Here, we attempt to shed light on the internal structure of deep saliency models and study what features they extract for fixation prediction. Specifically, we use a simple yet powerful architecture, consisting of only one CNN and a single resolution input, combined with a new loss function for pixel-wise fixation prediction during free viewing of natural scenes. We show that our simple method is on par or better than state-of-the-art complicated saliency models. Furthermore, we propose a method, related to saliency model evaluation metrics, to visualize deep models for fixation prediction. Our method reveals the inner representations of deep models for fixation prediction and provides evidence that saliency, as experienced by humans, is likely to involve high-level semantic knowledge in addition to low-level perceptual cues. Our results can be useful to measure the gap between current saliency models and the human inter-observer model and to build new models to close this gap.
CVJan 12, 2018
Deep saliency: What is learnt by a deep network about saliency?Sen He, Nicolas Pugeault
Deep convolutional neural networks have achieved impressive performance on a broad range of problems, beating prior art on established benchmarks, but it often remains unclear what are the representations learnt by those systems and how they achieve such performance. This article examines the specific problem of saliency detection, where benchmarks are currently dominated by CNN-based approaches, and investigates the properties of the learnt representation by visualizing the artificial neurons' receptive fields. We demonstrate that fine tuning a pre-trained network on the saliency detection task lead to a profound transformation of the network's deeper layers. Moreover we argue that this transformation leads to the emergence of receptive fields conceptually similar to the centre-surround filters hypothesized by early research on visual saliency.
ROSep 5, 2017
SeDAR - Semantic Detection and Ranging: Humans can localise without LiDAR, can robots?Oscar Mendez, Simon Hadfield, Nicolas Pugeault et al.
How does a person work out their location using a floorplan? It is probably safe to say that we do not explicitly measure depths to every visible surface and try to match them against different pose estimates in the floorplan. And yet, this is exactly how most robotic scan-matching algorithms operate. Similarly, we do not extrude the 2D geometry present in the floorplan into 3D and try to align it to the real-world. And yet, this is how most vision-based approaches localise. Humans do the exact opposite. Instead of depth, we use high level semantic cues. Instead of extruding the floorplan up into the third dimension, we collapse the 3D world into a 2D representation. Evidence of this is that many of the floorplans we use in everyday life are not accurate, opting instead for high levels of discriminative landmarks. In this work, we use this insight to present a global localisation approach that relies solely on the semantic labels present in the floorplan and extracted from RGB images. While our approach is able to use range measurements if available, we demonstrate that they are unnecessary as we can achieve results comparable to state-of-the-art without them.