J. Ramirez

ML
h-index31
5papers
20citations
Novelty41%
AI Score22

5 Papers

LGNov 14, 2012
Spectral Clustering: An empirical study of Approximation Algorithms and its Application to the Attrition Problem

B. Cung, T. Jin, J. Ramirez et al.

Clustering is the problem of separating a set of objects into groups (called clusters) so that objects within the same cluster are more similar to each other than to those in different clusters. Spectral clustering is a now well-known method for clustering which utilizes the spectrum of the data similarity matrix to perform this separation. Since the method relies on solving an eigenvector problem, it is computationally expensive for large datasets. To overcome this constraint, approximation methods have been developed which aim to reduce running time while maintaining accurate classification. In this article, we summarize and experimentally evaluate several approximation methods for spectral clustering. From an applications standpoint, we employ spectral clustering to solve the so-called attrition problem, where one aims to identify from a set of employees those who are likely to voluntarily leave the company from those who are not. Our study sheds light on the empirical performance of existing approximate spectral clustering methods and shows the applicability of these methods in an important business optimization related problem.

MLFeb 23, 2024
Statistical Agnostic Regression: a machine learning method to validate regression models

Juan M Gorriz, J. Ramirez, F. Segovia et al.

Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources. Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection. In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least $1-η$, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations demonstrate the ability of the proposed agnostic (non-parametric) test to provide an analysis of variance similar to the classical multivariate $F$-test for the slope parameter, without relying on the underlying assumptions of classical methods. Moreover, the residuals computed from this method represent a trade-off between those obtained from ML approaches and the classical OLS.

QMApr 22, 2024
Unraveling the Autism spectrum heterogeneity: Insights from ABIDE I Database using data/model-driven permutation testing approaches

F. J. Alcaide, I. A. Illan, J. Ramirez et al.

Autism Spectrum Condition (ASC) is a neurodevelopmental condition characterized by impairments in communication, social interaction and restricted or repetitive behaviors. Extensive research has been conducted to identify distinctions between individuals with ASC and neurotypical individuals. However, limited attention has been given to comprehensively evaluating how variations in image acquisition protocols across different centers influence these observed differences. This analysis focuses on structural magnetic resonance imaging (sMRI) data from the Autism Brain Imaging Data Exchange I (ABIDE I) database, evaluating subjects' condition and individual centers to identify disparities between ASC and control groups. Statistical analysis, employing permutation tests, utilizes two distinct statistical mapping methods: Statistical Agnostic Mapping (SAM) and Statistical Parametric Mapping (SPM). Results reveal the absence of statistically significant differences in any brain region, attributed to factors such as limited sample sizes within certain centers, noise effects and the problem of multicentrism in a heterogeneous condition such as autism. This study indicates limitations in using the ABIDE I database to detect structural differences in the brain between neurotypical individuals and those diagnosed with ASC. Furthermore, results from the SAM mapping method show greater consistency with existing literature.

MLFeb 9, 2022
A hypothesis-driven method based on machine learning for neuroimaging data analysis

JM Gorriz, R. Martin-Clemente, C. G. Puntonet et al.

There remains an open question about the usefulness and the interpretation of Machine learning (MLE) approaches for discrimination of spatial patterns of brain images between samples or activation states. In the last few decades, these approaches have limited their operation to feature extraction and linear classification tasks for between-group inference. In this context, statistical inference is assessed by randomly permuting image labels or by the use of random effect models that consider between-subject variability. These multivariate MLE-based statistical pipelines, whilst potentially more effective for detecting activations than hypotheses-driven methods, have lost their mathematical elegance, ease of interpretation, and spatial localization of the ubiquitous General linear Model (GLM). Recently, the estimation of the conventional GLM has been demonstrated to be connected to an univariate classification task when the design matrix is expressed as a binary indicator matrix. In this paper we explore the complete connection between the univariate GLM and MLE \emph{regressions}. To this purpose we derive a refined statistical test with the GLM based on the parameters obtained by a linear Support Vector Regression (SVR) in the \emph{inverse} problem (SVR-iGLM). Subsequently, random field theory (RFT) is employed for assessing statistical significance following a conventional GLM benchmark. Experimental results demonstrate how parameter estimations derived from each model (mainly GLM and SVR) result in different experimental design estimates that are significantly related to the predefined functional task. Moreover, using real data from a multisite initiative the proposed MLE-based inference demonstrates statistical power and the control of false positives, outperforming the regular GLM.

IVNov 27, 2020
Uncertainty-driven ensembles of deep architectures for multiclass classification. Application to COVID-19 diagnosis in chest X-ray images

Juan E. Arco, A. Ortiz, J. Ramirez et al.

Respiratory diseases kill million of people each year. Diagnosis of these pathologies is a manual, time-consuming process that has inter and intra-observer variability, delaying diagnosis and treatment. The recent COVID-19 pandemic has demonstrated the need of developing systems to automatize the diagnosis of pneumonia, whilst Convolutional Neural Network (CNNs) have proved to be an excellent option for the automatic classification of medical images. However, given the need of providing a confidence classification in this context it is crucial to quantify the reliability of the model's predictions. In this work, we propose a multi-level ensemble classification system based on a Bayesian Deep Learning approach in order to maximize performance while quantifying the uncertainty of each classification decision. This tool combines the information extracted from different architectures by weighting their results according to the uncertainty of their predictions. Performance of the Bayesian network is evaluated in a real scenario where simultaneously differentiating between four different pathologies: control vs bacterial pneumonia vs viral pneumonia vs COVID-19 pneumonia. A three-level decision tree is employed to divide the 4-class classification into three binary classifications, yielding an accuracy of 98.06% and overcoming the results obtained by recent literature. The reduced preprocessing needed for obtaining this high performance, in addition to the information provided about the reliability of the predictions evidence the applicability of the system to be used as an aid for clinicians.