Marcos Matabuena

ML
h-index53
12papers
22citations
Novelty56%
AI Score53

12 Papers

MLSep 4, 2024
Conformal Prediction in Dynamic Biological Systems

Alberto Portela, Julio R. Banga, Marcos Matabuena

Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex biological systems. Here, we focus on dynamic models represented by deterministic nonlinear ordinary differential equations. Many current UQ approaches in this field rely on Bayesian statistical methods. While powerful, these methods often require strong prior specifications and make parametric assumptions that may not always hold in biological systems. Additionally, these methods face challenges in domains where sample sizes are limited, and statistical inference becomes constrained, with computational speed being a bottleneck in large models of biological systems. As an alternative, we propose the use of conformal inference methods, introducing two novel algorithms that, in some instances, offer non-asymptotic guarantees, enhancing robustness and scalability across various applications. We demonstrate the efficacy of our proposed algorithms through several scenarios, highlighting their advantages over traditional Bayesian approaches. The proposed methods show promising results for diverse biological data structures and scenarios, offering a general framework to quantify uncertainty for dynamic models of biological systems.The software for the methodology and the reproduction of the results is available at https://zenodo.org/doi/10.5281/zenodo.13644870.

MLJun 14, 2022
Neural interval-censored survival regression with feature selection

Carlos García Meixide, Marcos Matabuena, Louis Abraham et al.

Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-existent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval-censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: i) a variable selection phase leveraging recent advances on sparse neural network architectures, ii) a regression model targeting prediction of the interval-censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real-world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring non-linear relationships.

8.4MLMay 7
Gaussian mixture models in Hilbert spaces via kernel methods

Daniel López-Montero, Antonio Álvarez-López, Marcos Matabuena

Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^2$-functional data and random graphs in Laplacian spaces arising in modern medical applications.

13.0MLMar 25
Continuous-Time Learning of Probability Distributions: A Case Study in a Digital Trial of Young Children with Type 1 Diabetes

Antonio Álvarez-López, Marcos Matabuena

Understanding how biomarker distributions evolve over time is a central challenge in digital health and chronic disease monitoring. In diabetes, changes in the distribution of glucose measurements can reveal patterns of disease progression and treatment response that conventional summary measures miss. Motivated by a 26-week clinical trial comparing the closed-loop insulin delivery system t:slim X2 with standard therapy in children with type 1 diabetes, we propose a probabilistic framework to model the continuous-time evolution of time-indexed distributions using continuous glucose monitoring data (CGM) collected every five minutes. We represent the glucose distribution as a Gaussian mixture, with time-varying mixture weights governed by a neural ODE. We estimate the model parameter using a distribution-matching criterion based on the maximum mean discrepancy. The resulting framework is interpretable, computationally efficient, and sensitive to subtle temporal distributional changes. Applied to CGM trial data, the method detects treatment-related improvements in glucose dynamics that are difficult to capture with traditional analytical approaches.

35.0MLMay 4
Random-Effects Algorithm for Random Objects in Metric Spaces

Marcos Matabuena, Mateo Cámara

Across many scientific disciplines, multiple observations are collected from the same experimental units, and in modern datasets these observations often arise as non-Euclidean random objects. In such settings, the incorporation of random effects is a critical modeling step for efficient estimation and personalized prediction. Although mixed-effects models are well established for scalar outcomes and, more recently, for functional data in Hilbert spaces, general random-effects frameworks for objects in metric spaces remain underdeveloped. In this paper, we propose a nonlinear Fréchet-based algorithm for random-effects modeling of arbitrary random objects defined on a metric space. Using M-estimation theory, we establish conditions under which the proposed metric-space prediction target is consistently estimated under a working random-effects formulation. We then evaluate the empirical performance of the proposed method using both synthetic data and digital health datasets that require practical tools for analyzing random objects in metric spaces, such as multivariate probability distributions and random graphs. We show that, although our method is developed beyond Hilbert spaces, it can outperform existing Hilbert space-based methods.

MEFeb 2, 2024
Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection

Marcos Matabuena, Juan C. Vidal, Oscar Hernan Madrid Padilla et al.

We introduce a novel \textit{k}-nearest neighbor (\textit{k}-NN) regression method for joint estimation of the conditional mean and variance. The proposed algorithm preserves the computational efficiency and manifold-learning capabilities of classical non-parametric \textit{k}-NN models, while integrating a data-driven variable selection step that improves empirical performance. By accurately estimating both conditional mean and variance regression functions, the method effectively reconstructs the conditional distribution and density functions for multiple families of scale-and-localization generative models. We show that our estimator can achieve fast convergence rates, and we derive practical rules for selecting the smoothing parameter~$k$ that enhance the precision of the algorithm in finite sample regimes. Extensive simulations for low, moderate and large-dimensional covariate spaces, together with a real-world biomedical application, demonstrate that the proposed method can consistently outperform the conventional \textit{k-NN} regression algorithm while being more interpretable in the model output.

MLJul 21, 2025
Conformal and kNN Predictive Uncertainty Quantification Algorithms in Metric Spaces

Gábor Lugosi, Marcos Matabuena

This paper introduces a framework for uncertainty quantification in regression models defined in metric spaces. Leveraging a newly defined notion of homoscedasticity, we develop a conformal prediction algorithm that offers finite-sample coverage guarantees and fast convergence rates of the oracle estimator. In heteroscedastic settings, we forgo these non-asymptotic guarantees to gain statistical efficiency, proposing a local $k$--nearest--neighbor method without conformal calibration that is adaptive to the geometry of each particular nonlinear space. Both procedures work with any regression algorithm and are scalable to large data sets, allowing practitioners to plug in their preferred models and incorporate domain expertise. We prove consistency for the proposed estimators under minimal conditions. Finally, we demonstrate the practical utility of our approach in personalized--medicine applications involving random response objects such as probability distributions and graph Laplacians.

MLJun 10, 2025
Model-Free Kernel Conformal Depth Measures Algorithm for Uncertainty Quantification in Regression Models in Separable Hilbert Spaces

Marcos Matabuena, Rahul Ghosal, Pavlo Mozharovskyi et al.

Depth measures are powerful tools for defining level sets in emerging, non--standard, and complex random objects such as high-dimensional multivariate data, functional data, and random graphs. Despite their favorable theoretical properties, the integration of depth measures into regression modeling to provide prediction regions remains a largely underexplored area of research. To address this gap, we propose a novel, model-free uncertainty quantification algorithm based on conditional depth measures--specifically, conditional kernel mean embeddings and an integrated depth measure. These new algorithms can be used to define prediction and tolerance regions when predictors and responses are defined in separable Hilbert spaces. The use of kernel mean embeddings ensures faster convergence rates in prediction region estimation. To enhance the practical utility of the algorithms with finite samples, we also introduce a conformal prediction variant that provides marginal, non-asymptotic guarantees for the derived prediction regions. Additionally, we establish both conditional and unconditional consistency results, as well as fast convergence rates in certain homoscedastic settings. We evaluate the finite--sample performance of our model in extensive simulation studies involving various types of functional data and traditional Euclidean scenarios. Finally, we demonstrate the practical relevance of our approach through a digital health application related to physical activity, aiming to provide personalized recommendations

MLMay 13, 2025
Continuous Temporal Learning of Probability Distributions via Neural ODEs with Applications in Continuous Glucose Monitoring Data

Antonio Álvarez-López, Marcos Matabuena

Modeling the dynamics of probability distributions from time-dependent data samples is a fundamental problem in many fields, including digital health. The goal is to analyze how the distribution of a biomarker, such as glucose, changes over time and how these changes may reflect the progression of chronic diseases such as diabetes. We introduce a probabilistic model based on a Gaussian mixture that captures the evolution of a continuous-time stochastic process. Our approach combines a nonparametric estimate of the distribution, obtained with Maximum Mean Discrepancy (MMD), and a Neural Ordinary Differential Equation (Neural ODE) that governs the temporal evolution of the mixture weights. The model is highly interpretable, detects subtle distribution shifts, and remains computationally efficient. We illustrate the broad utility of our approach in a 26-week clinical trial that treats all continuous glucose monitoring (CGM) time series as the primary outcome. This method enables rigorous longitudinal comparisons between the treatment and control arms and yields characterizations that conventional summary-based clinical trials analytical methods typically do not capture.

MLJan 12, 2025
Variable Selection Methods for Multivariate, Functional, and Complex Biomedical Data in the AI Age

Marcos Matabuena

Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.

MLDec 11, 2020
Glucose values prediction five years ahead with a new framework of missing responses in reproducing kernel Hilbert spaces, and the use of continuous glucose monitoring technology

Marcos Matabuena, Paulo Félix, Carlos Meijide-Garcia et al.

AEGIS study possesses unique information on longitudinal changes in circulating glucose through continuous glucose monitoring technology (CGM). However, as usual in longitudinal medical studies, there is a significant amount of missing data in the outcome variables. For example, 40 percent of glycosylated hemoglobin (A1C) biomarker data are missing five years ahead. With the purpose to reduce the impact of this issue, this article proposes a new data analysis framework based on learning in reproducing kernel Hilbert spaces (RKHS) with missing responses that allows to capture non-linear relations between variable studies in different supervised modeling tasks. First, we extend the Hilbert-Schmidt dependence measure to test statistical independence in this context introducing a new bootstrap procedure, for which we prove consistency. Next, we adapt or use existing models of variable selection, regression, and conformal inference to obtain new clinical findings about glucose changes five years ahead with the AEGIS data. The most relevant findings are summarized below: i) We identify new factors associated with long-term glucose evolution; ii) We show the clinical sensibility of CGM data to detect changes in glucose metabolism; iii) We can improve clinical interventions based on our algorithms' expected glucose changes according to patients' baseline characteristics.

AIFeb 6, 2019
The FA Quantifier Fuzzification Mechanism: analysis of convergence and efficient implementations

Félix Díaz-Hermida, Marcos Matabuena, Juan C. Vidal

The fuzzy quantification model FA has been identified as one of the best behaved quantification models in several revisions of the field of fuzzy quantification. This model is, to our knowledge, the unique one fulfilling the strict Determiner Fuzzification Scheme axiomatic framework that does not induce the standard min and max operators. The main contribution of this paper is the proof of a convergence result that links this quantification model with the Zadeh's model when the size of the input sets tends to infinite. The convergence proof is, in any case, more general than the convergence to the Zadeh's model, being applicable to any quantitative quantifier. In addition, recent revisions papers have presented some doubts about the existence of suitable computational implementations to evaluate the FA model in practical applications. In order to prove that this model is not only a theoretical approach, we show exact algorithmic solutions for the most common linguistic quantifiers as well as an approximate implementation by means of Monte Carlo. Additionally, we will also give a general overview of the main properties fulfilled by the FA model, as a single compendium integrating the whole set of properties fulfilled by it has not been previously published.