Andrew Duncan

ML
h-index25
14papers
126citations
Novelty49%
AI Score54

14 Papers

CVNov 3, 2022
Expanding Accurate Person Recognition to New Altitudes and Ranges: The BRIAR Dataset

David Cornett, Joel Brogan, Nell Barber et al.

Face recognition technology has advanced significantly in recent years due largely to the availability of large and increasingly complex training datasets for use in deep learning models. These datasets, however, typically comprise images scraped from news sites or social media platforms and, therefore, have limited utility in more advanced security, forensics, and military applications. These applications require lower resolution, longer ranges, and elevated viewpoints. To meet these critical needs, we collected and curated the first and second subsets of a large multi-modal biometric dataset designed for use in the research and development (R&D) of biometric recognition technologies under extremely challenging conditions. Thus far, the dataset includes more than 350,000 still images and over 1,300 hours of video footage of approximately 1,000 subjects. To collect this data, we used Nikon DSLR cameras, a variety of commercial surveillance cameras, specialized long-rage R&D cameras, and Group 1 and Group 2 UAV platforms. The goal is to support the development of algorithms capable of accurately recognizing people at ranges up to 1,000 m and from high angles of elevation. These advances will include improvements to the state of the art in face recognition and will support new research in the area of whole-body recognition using methods based on gait and anthropometry. This paper describes methods used to collect and curate the dataset, and the dataset's characteristics at the current stage.

FLMay 24
A substitution lemma for multiple context-free languages

Andrew Duncan, Murray Elder, Lisa Frenkel et al.

We present a necessary condition for an infinite language to be multiple context-free, which we call a Substitution Lemma. We apply it to show a sample selection of languages are not multiple context-free, including the word problem of the group $F_2\times F_2$. We also show that groups with multiple context-free word problem have decidable rational subset membership problem. Our result contrasts with previous work showing that the standard pumping lemma for context-free languages cannot be generalised to multiple context-free languages, and that weak variants of generalised Ogden's lemma do not apply to multiple context-free languages.

OCSep 10, 2024
Modelling Global Trade with Optimal Transport

Thomas Gaskin, Guven Demirel, Marie-Therese Wolfram et al.

Global trade is shaped by a complex mix of factors beyond supply and demand, including tangible variables like transport costs and tariffs, as well as less quantifiable influences such as political and economic relations. Traditionally, economists model trade using gravity models, which rely on explicit covariates that might struggle to capture these subtler drivers of trade. In this work, we employ optimal transport and a deep neural network to learn a time-dependent cost function from data, without imposing a specific functional form. This approach consistently outperforms traditional gravity models in accuracy and has similar performance to three-way gravity models, while providing natural uncertainty quantification. Applying our framework to global food and agricultural trade, we show that the Global South suffered disproportionately from the war in Ukraine's impact on wheat markets. We also analyse the effects of free-trade agreements and trade disputes with China, as well as Brexit's impact on British trade with Europe, uncovering hidden patterns that trade volumes alone cannot reveal.

MLJan 29
Diffusion Path Samplers via Sequential Monte Carlo

James Matthew Young, Paula Cordero-Encinar, Sebastian Reich et al.

We develop a diffusion-based sampler for target distributions known up to a normalising constant. To this end, we rely on the well-known diffusion path that smoothly interpolates between a (simple) base distribution and the target distribution, widely used in diffusion models. Our approach is based on a practical implementation of diffusion-annealed Langevin Monte Carlo, which approximates the diffusion path with convergence guarantees. We tackle the score estimation problem by developing an efficient sequential Monte Carlo sampler that evolves auxiliary variables from conditional distributions along the path, which provides principled score estimates for time-varying distributions. We further develop novel control variate schedules that minimise the variance of these score estimates. Finally, we provide theoretical guarantees and empirically demonstrate the effectiveness of our method on several synthetic and real-world datasets.

MLOct 2, 2025
Uniform-in-time convergence bounds for Persistent Contrastive Divergence Algorithms

Paul Felix Valsecchi Oliva, O. Deniz Akyildiz, Andrew Duncan

We propose a continuous-time formulation of persistent contrastive divergence (PCD) for maximum likelihood estimation (MLE) of unnormalised densities. Our approach expresses PCD as a coupled, multiscale system of stochastic differential equations (SDEs), which perform optimisation of the parameter and sampling of the associated parametrised density, simultaneously. From this novel formulation, we are able to derive explicit bounds for the error between the PCD iterates and the MLE solution for the model parameter. This is made possible by deriving uniform-in-time (UiT) bounds for the difference in moments between the multiscale system and the averaged regime. An efficient implementation of the continuous-time scheme is introduced, leveraging a class of explicit, stable intregators, stochastic orthogonal Runge-Kutta Chebyshev (S-ROCK), for which we provide explicit error estimates in the long-time regime. This leads to a novel method for training energy-based models (EBMs) with explicit error guarantees.

CVJan 23, 2025
Expanding on the BRIAR Dataset: A Comprehensive Whole Body Biometric Recognition Resource at Extreme Distances and Real-World Scenarios (Collections 1-4)

Gavin Jager, David Cornett, Gavin Glenn et al.

The state-of-the-art in biometric recognition algorithms and operational systems has advanced quickly in recent years providing high accuracy and robustness in more challenging collection environments and consumer applications. However, the technology still suffers greatly when applied to non-conventional settings such as those seen when performing identification at extreme distances or from elevated cameras on buildings or mounted to UAVs. This paper summarizes an extension to the largest dataset currently focused on addressing these operational challenges, and describes its composition as well as methodologies of collection, curation, and annotation.

CLAug 15, 2025
Retrieval-augmented reasoning with lean language models

Ryan Sze-Yin Chan, Federico Nanni, Tomas Lazauskas et al.

This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.

LGNov 18, 2025
How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding

Mathieu Dufour, Andrew Duncan

Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ($\varepsilon \in \{4, 6\}$), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63\% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC $\approx$ 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.

MESep 1, 2025
Sampling as Bandits: Evaluation-Efficient Design for Black-Box Densities

Takuo Matsubara, Andrew Duncan, Simon Cotter et al.

We introduce bandit importance sampling (BIS), a new class of importance sampling methods designed for settings where the target density is expensive to evaluate. In contrast to adaptive importance sampling, which optimises a proposal distribution, BIS directly designs the samples through a sequential strategy that combines space-filling designs with multi-armed bandits. Our method leverages Gaussian process surrogates to guide sample selection, enabling efficient exploration of the parameter space with minimal target evaluations. We establish theoretical guarantees on convergence and demonstrate the effectiveness of the method across a broad range of sampling tasks. BIS delivers accurate approximations with fewer target evaluations, outperforming competing approaches across multimodal, heavy-tailed distributions, and real-world applications to Bayesian inference of computationally expensive models.

MLOct 15, 2024
Deep Optimal Sensor Placement for Black Box Stochastic Simulations

Paula Cordero-Encinar, Tobias Schröder, Peter Yatsyshin et al.

Selecting cost-effective optimal sensor configurations for subsequent inference of parameters in black-box stochastic systems faces significant computational barriers. We propose a novel and robust approach, modelling the joint distribution over input parameters and solution with a joint energy-based model, trained on simulation data. Unlike existing simulation-based inference approaches, which must be tied to a specific set of point evaluations, we learn a functional representation of parameters and solution. This is used as a resolution-independent plug-and-play surrogate for the joint distribution, which can be conditioned over any set of points, permitting an efficient approach to sensor placement. We demonstrate the validity of our framework on a variety of stochastic problems, showing that our method provides highly informative sensor locations at a lower computational cost compared to conventional approaches.

MLMay 15, 2023
Meta-models for transfer learning in source localisation

Lawrence A. Bull, Matthew R. Jones, Elizabeth J. Cross et al.

In practice, non-destructive testing (NDT) procedures tend to consider experiments (and their respective models) as distinct, conducted in isolation and associated with independent data. In contrast, this work looks to capture the interdependencies between acoustic emission (AE) experiments (as meta-models) and then use the resulting functions to predict the model hyperparameters for previously unobserved systems. We utilise a Bayesian multilevel approach (similar to deep Gaussian Processes) where a higher level meta-model captures the inter-task relationships. Our key contribution is how knowledge of the experimental campaign can be encoded between tasks as well as within tasks. We present an example of AE time-of-arrival mapping for source localisation, to illustrate how multilevel models naturally lend themselves to representing aggregate systems in engineering. We constrain the meta-model based on domain knowledge, then use the inter-task functions for transfer learning, predicting hyperparameters for models of previously unobserved experiments (for a specific design).

MLFeb 7, 2022
Grassmann Stein Variational Gradient Descent

Xing Liu, Harrison Zhu, Jean-François Ton et al.

Stein variational gradient descent (SVGD) is a deterministic particle inference algorithm that provides an efficient alternative to Markov chain Monte Carlo. However, SVGD has been found to suffer from variance underestimation when the dimensionality of the target distribution is high. Recent developments have advocated projecting both the score function and the data onto real lines to sidestep this issue, although this can severely overestimate the epistemic (model) uncertainty. In this work, we propose Grassmann Stein variational gradient descent (GSVGD) as an alternative approach, which permits projections onto arbitrary dimensional subspaces. Compared with other variants of SVGD that rely on dimensionality reduction, GSVGD updates the projectors simultaneously for the score function and the data, and the optimal projectors are determined through a coupled Grassmann-valued diffusion process which explores favourable subspaces. Both our theoretical and experimental results suggest that GSVGD enjoys efficient state-space exploration in high-dimensional problems that have an intrinsic low-dimensional structure.

LGFeb 4, 2022
Energy-Based Models for Functional Data using Path Measure Tilting

Jen Ning Lim, Sebastian Vollmer, Lorenz Wolf et al.

Energy-Based Models (EBMs) have proven to be a highly effective approach for modelling densities on finite-dimensional spaces. Their ability to incorporate domain-specific choices and constraints into the structure of the model through composition make EBMs an appealing candidate for applications in physics, biology and computer vision and various other fields. Recently, Energy-Based Processes (EBP) for modelling stochastic processes was proposed for \textit{unconditional} exchangeable data (e.g., point clouds). In this work, we present a novel subclass of EBPs, called $\mathcal{F}$-EBM for \textit{conditional} exchangeable data, which is able to learn distributions of functions (such as curves or surfaces) from functional samples evaluated at finitely many points. Two unique challenges arise in the functional context. Firstly, training data is often not evaluated along a fixed set of points. Secondly, steps must be taken to control the behaviour of the model between evaluation points, to mitigate overfitting. The proposed model is an energy based model on function space that is decomposed spectrally, where a Gaussian Process path measure is used to reweight the distribution to capture smoothness properties of the underlying process being modelled. The resulting model has the ability to utilize irregularly sampled training data and can output predictions at any resolution, providing an effective approach to up-scaling functional data. We demonstrate the efficacy of our proposed approach for modelling a range of datasets, including data collected from Standard and Poor's 500 (S\&P) and UK National grid.

FLU-DYNJan 13, 2022
Density reconstruction from schlieren images through Bayesian nonparametric models

Bryn Noel Ubald, Pranay Seshadri, Andrew Duncan

This study proposes a radically alternate approach for extracting quantitative information from schlieren images. The method uses a scaled, derivative enhanced Gaussian process model to obtain true density estimates from two corresponding schlieren images with the knife-edge at horizontal and vertical orientations. We illustrate our approach on schlieren images taken from a wind tunnel sting model, a supersonic aircraft in flight, and a high-order numerical shock tube simulation.