LGAug 29, 2024Code
Flexible framework for generating synthetic electrocardiograms and photoplethysmogramsKatri Karhinoja, Antti Vasankari, Jukka-Pekka Sirkiä et al.
By generating synthetic biosignals, the quantity and variety of health data can be increased. This is especially useful when training machine learning models by enabling data augmentation and introduction of more physiologically plausible variation to the data. For these purposes, we have developed a synthetic biosignal model for two signal modalities, electrocardiography (ECG) and photoplethysmography (PPG). The model produces realistic signals that account for physiological effects such as breathing modulation and changes in heart rate due to physical stress. Arrhythmic signals can be generated with beat intervals extracted from real measurements. The model also includes a flexible approach to adding different kinds of noise and signal artifacts. The noise is generated from power spectral densities extracted from both measured noisy signals and modeled power spectra. Importantly, the model also automatically produces labels for noise, segmentation (e.g. P and T waves, QRS complex, for electrocardiograms), and artifacts. We assessed how this comprehensive model can be used in practice to improve the performance of models trained on ECG or PPG data. For example, we trained an LSTM to detect ECG R-peaks using both real ECG signals from the MIT-BIH arrythmia set and our new generator. The F1 score of the model was 0.83 using real data, in comparison to 0.98 using our generator. In addition, the model can be used for example in signal segmentation, quality detection and bench-marking detection algorithms. The model code has been released in \url{https://github.com/UTU-Health-Research/framework_for_synthetic_biosignals}
LGFeb 11
Evaluation metrics for temporal preservation in synthetic longitudinal patient dataKatariina Perkonoja, Parisa Movahedi, Antti Airola et al.
This study introduces a set of metrics for evaluating temporal preservation in synthetic longitudinal patient data, defined as artificially generated data that mimic real patients' repeated measurements over time. The proposed metrics assess how synthetic data reproduces key temporal characteristics, categorized into marginal, covariance, individual-level and measurement structures. We show that strong marginal-level resemblance may conceal distortions in covariance and disruptions in individual-level trajectories. Temporal preservation is influenced by factors such as original data quality, measurement frequency, and preprocessing strategies, including binning, variable encoding and precision. Variables with sparse or highly irregular measurement times provide limited information for learning temporal dependencies, resulting in reduced resemblance between the synthetic and original data. No single metric adequately captures temporal preservation; instead, a multidimensional evaluation across all characteristics provides a more comprehensive assessment of synthetic data quality. Overall, the proposed metrics clarify how and why temporal structures are preserved or degraded, enabling more reliable evaluation and improvement of generative models and supporting the creation of temporally realistic synthetic longitudinal patient data.
LGMar 22, 2024
Empirical investigation of multi-source cross-validation in clinical ECG classificationTuija Leinonen, David Wong, Antti Vasankari et al.
Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data.
LGMar 20, 2024
Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen et al.
Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: To investigate the reliability of group differences identified by independent sample tests on DP-synthetic data. The evaluation is conducted in terms of the tests' Type I and Type II errors. The former quantifies the tests' validity i.e. whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries. Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $ε\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($ε\geq 5$) in order to have reasonable Type II error.
LGOct 16, 2025
Interaction Concordance Index: Performance Evaluation for Interaction Prediction MethodsTapio Pahikkala, Riikka Numminen, Parisa Movahedi et al.
Consider two sets of entities and their members' mutual affinity values, say drug-target affinities (DTA). Drugs and targets are said to interact in their effects on DTAs if drug's effect on it depends on the target. Presence of interaction implies that assigning a drug to a target and another drug to another target does not provide the same aggregate DTA as the reversed assignment would provide. Accordingly, correctly capturing interactions enables better decision-making, for example, in allocation of limited numbers of drug doses to their best matching targets. Learning to predict DTAs is popularly done from either solely from known DTAs or together with side information on the entities, such as chemical structures of drugs and targets. In this paper, we introduce interaction directions' prediction performance estimator we call interaction concordance index (IC-index), for both fixed predictors and machine learning algorithms aimed for inferring them. IC-index complements the popularly used DTA prediction performance estimators by evaluating the ratio of correctly predicted directions of interaction effects in data. First, we show the invariance of IC-index on predictors unable to capture interactions. Secondly, we show that learning algorithm's permutation equivariance regarding drug and target identities implies its inability to capture interactions when either drug, target or both are unseen during training. In practical applications, this equivariance is remedied via incorporation of appropriate side information on drugs and targets. We make a comprehensive empirical evaluation over several biomedical interaction data sets with various state-of-the-art machine learning algorithms. The experiments demonstrate how different types of affinity strength prediction methods perform in terms of IC-index complementing existing prediction performance estimators.
CRSep 3, 2025
A Comprehensive Guide to Differential Privacy: From Theory to User ExpectationsNapsu Karmitsa, Antti Airola, Tapio Pahikkala et al.
The increasing availability of personal data has enabled significant advances in fields such as machine learning, healthcare, and cybersecurity. However, this data abundance also raises serious privacy concerns, especially in light of powerful re-identification attacks and growing legal and ethical demands for responsible data use. Differential privacy (DP) has emerged as a principled, mathematically grounded framework for mitigating these risks. This review provides a comprehensive survey of DP, covering its theoretical foundations, practical mechanisms, and real-world applications. It explores key algorithmic tools and domain-specific challenges - particularly in privacy-preserving machine learning and synthetic data generation. The report also highlights usability issues and the need for improved communication and transparency in DP systems. Overall, the goal is to support informed adoption of DP by researchers and practitioners navigating the evolving landscape of data privacy.
LGNov 11, 2021
Training neural networks with synthetic electrocardiogramsMatti Kaisti, Juho Laitala, Antti Airola
We present a method for training neural networks with synthetic electrocardiograms that mimic signals produced by a wearable single lead electrocardiogram monitor. We use domain randomization where the synthetic signal properties such as the waveform shape, RR-intervals and noise are varied for every training example. Models trained with synthetic data are compared to their counterparts trained with real data. Detection of r-waves in electrocardiograms recorded during different physical activities and in atrial fibrillation is used to compare the models. By allowing the randomization to increase beyond what is typically observed in the real-world data the performance is on par or superseding the performance of networks trained with real data. Experiments show robust performance with different seeds and training examples on different test sets without any test set specific tuning. The method makes possible to train neural networks using practically free-to-collect data with accurate labels without the need for manual annotations and it opens up the possibility of extending the use of synthetic data on cardiac disease classification when disease specific a priori information is used in the electrocardiogram generation. Additionally the distribution of data can be controlled eliminating class imbalances that are typically observed in health related data and additionally the generated data is inherently private.
LGMar 22, 2021
A Link between Coding Theory and Cross-Validation with ApplicationsTapio Pahikkala, Parisa Movahedi, Ileana Montoya et al.
How many different binary classification problems a single learning algorithm can solve on a fixed data with exactly zero or at most a given number of cross-validation errors? While the number in the former case is known to be limited by the no-free-lunch theorem, we show that the exact answers are given by the theory of error detecting codes. As a case study, we focus on the AUC performance measure and leave-pair-out cross-validation (LPOCV), in which every possible pair of data with different class labels is held out at a time. We show that the maximal number of classification problems with fixed class proportion, for which a learning algorithm can achieve zero LPOCV error, equals the maximal number of code words in a constant weight code (CWC), with certain technical properties. We then generalize CWCs by introducing light CWCs, and prove an analogous result for nonzero LPOCV errors and light CWCs. Moreover, we prove both upper and lower bounds on the maximal numbers of code words in light CWCs. Finally, as an immediate practical application, we develop new LPOCV based randomization tests for learning algorithms that generalize the classical Wilcoxon-Mann-Whitney U test.
LGSep 2, 2020
Generalized vec trick for fast learning of pairwise kernel modelsMarkus Viljanen, Antti Airola, Tapio Pahikkala
Pairwise learning corresponds to the supervised learning setting where the goal is to make predictions for pairs of objects. Prominent applications include predicting drug-target or protein-protein interactions, or customer-product preferences. In this work, we present a comprehensive review of pairwise kernels, that have been proposed for incorporating prior knowledge about the relationship between the objects. Specifically, we consider the standard, symmetric and anti-symmetric Kronecker product kernels, metric-learning, Cartesian, ranking, as well as linear, polynomial and Gaussian kernels. Recently, a O(nm + nq) time generalized vec trick algorithm, where n, m, and q denote the number of pairs, drugs and targets, was introduced for training kernel methods with the Kronecker product kernel. This was a significant improvement over previous O(n^2) training methods, since in most real-world applications m,q << n. In this work we show how all the reviewed kernels can be expressed as sums of Kronecker products, allowing the use of generalized vec trick for speeding up their computation. In the experiments, we demonstrate how the introduced approach allows scaling pairwise kernels to much larger data sets than previously feasible, and provide an extensive comparison of the kernels on a number of biological interaction prediction tasks.
MLMar 5, 2018
A Comparative Study of Pairwise Learning Methods based on Kernel Ridge RegressionMichiel Stock, Tapio Pahikkala, Antti Airola et al.
Many machine learning problems can be formulated as predicting labels for a pair of objects. Problems of that kind are often referred to as pairwise learning, dyadic prediction or network inference problems. During the last decade kernel methods have played a dominant role in pairwise learning. They still obtain a state-of-the-art predictive performance, but a theoretical analysis of their behavior has been underexplored in the machine learning literature. In this work we review and unify existing kernel-based algorithms that are commonly used in different pairwise learning settings, ranging from matrix filtering to zero-shot learning. To this end, we focus on closed-form efficient instantiations of Kronecker kernel ridge regression. We show that independent task kernel ridge regression, two-step kernel ridge regression and a linear matrix filter arise naturally as a special case of Kronecker kernel ridge regression, implying that all these methods implicitly minimize a squared loss. In addition, we analyze universality, consistency and spectral filtering properties. Our theoretical results provide valuable insights in assessing the advantages and limitations of existing pairwise learning methods.
MLJan 29, 2018
Tournament Leave-pair-out Cross-validation for Receiver Operating Characteristic (ROC) AnalysisIleana Montoya Perez, Antti Airola, Peter J. Boström et al.
Receiver operating characteristic (ROC) analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under ROC curve (AUC) with standard cross-validation methods suffers from a large bias. The leave-pair-out (LPO) cross-validation has been shown to correct this bias. However, while LPO produces an almost unbiased estimate of AUC, it does not provide a ranking of the data needed for plotting and analyzing the ROC curve. In this study, we propose a new method called tournament leave-pair-out (TLPO) cross-validation. This method extends LPO by creating a tournament from pair comparisons to produce a ranking for the data. TLPO preserves the advantage of LPO for estimating AUC, while it also allows performing ROC analyses. We have shown using both synthetic and real world data that TLPO is as reliable as LPO for AUC estimation, and confirmed the bias in leave-one-out cross-validation on low-dimensional data. As a case study on ROC analysis, we also evaluate how reliably sensitivity and specificity can be estimated from TLPO ROC curves.
APJan 4, 2017
Playtime Measurement with Survival AnalysisMarkus Viljanen, Antti Airola, Jukka Heikkonen et al.
Maximizing product use is a central goal of many businesses, which makes retention and monetization two central analytics metrics in games. Player retention may refer to various duration variables quantifying product use: total playtime or session playtime are popular research targets, and active playtime is well-suited for subscription games. Such research often has the goal of increasing player retention or conversely decreasing player churn. Survival analysis is a framework of powerful tools well suited for retention type data. This paper contributes new methods to game analytics on how playtime can be analyzed using survival analysis without covariates. Survival and hazard estimates provide both a visual and an analytic interpretation of the playtime phenomena as a funnel type nonparametric estimate. Metrics based on the survival curve can be used to aggregate this playtime information into a single statistic. Comparison of survival curves between cohorts provides a scientific AB-test. All these methods work on censored data and enable computation of confidence intervals. This is especially important in time and sample limited data which occurs during game development. Throughout this paper, we illustrate the application of these methods to real world game development problems on the Hipster Sheep mobile game.
LGJun 14, 2016
Efficient Pairwise Learning Using Kernel Ridge Regression: an Exact Two-Step MethodMichiel Stock, Tapio Pahikkala, Antti Airola et al.
Pairwise learning or dyadic prediction concerns the prediction of properties for pairs of objects. It can be seen as an umbrella covering various machine learning problems such as matrix completion, collaborative filtering, multi-task learning, transfer learning, network prediction and zero-shot learning. In this work we analyze kernel-based methods for pairwise learning, with a particular focus on a recently-suggested two-step method. We show that this method offers an appealing alternative for commonly-applied Kronecker-based methods that model dyads by means of pairwise feature representations and pairwise kernels. In a series of theoretical results, we establish correspondences between the two types of methods in terms of linear algebra and spectral filtering, and we analyze their statistical consistency. In addition, the two-step method allows us to establish novel algorithmic shortcuts for efficient training and validation on very large datasets. Putting those properties together, we believe that this simple, yet powerful method can become a standard tool for many problems. Extensive experimental results for a range of practical settings are reported.
MLJan 7, 2016
Fast Kronecker product kernel methods via generalized vec trickAntti Airola, Tapio Pahikkala
Kronecker product kernel provides the standard approach in the kernel methods literature for learning from graph data, where edges are labeled and both start and end vertices have their own feature representations. The methods allow generalization to such new edges, whose start and end vertices do not appear in the training data, a setting known as zero-shot or zero-data learning. Such a setting occurs in numerous applications, including drug-target interaction prediction, collaborative filtering and information retrieval. Efficient training algorithms based on the so-called vec trick, that makes use of the special structure of the Kronecker product, are known for the case where the training data is a complete bipartite graph. In this work we generalize these results to non-complete training graphs. This allows us to derive a general framework for training Kronecker product kernel methods, as specific examples we implement Kronecker ridge regression and support vector machine algorithms. Experimental results demonstrate that the proposed approach leads to accurate models, while allowing order of magnitude improvements in training and prediction time.
LGJun 19, 2015
Spectral Analysis of Symmetric and Anti-Symmetric Pairwise KernelsTapio Pahikkala, Markus Viljanen, Antti Airola et al.
We consider the problem of learning regression functions from pairwise data when there exists prior knowledge that the relation to be learned is symmetric or anti-symmetric. Such prior knowledge is commonly enforced by symmetrizing or anti-symmetrizing pairwise kernel functions. Through spectral analysis, we show that these transformations reduce the kernel's effective dimension. Further, we provide an analysis of the approximation properties of the resulting kernels, and bound the regularization bias of the kernels in terms of the corresponding bias of the original kernel.
LGMay 17, 2014
A two-step learning approach for solving full and almost full cold start problems in dyadic predictionTapio Pahikkala, Michiel Stock, Antti Airola et al.
Dyadic prediction methods operate on pairs of objects (dyads), aiming to infer labels for out-of-sample dyads. We consider the full and almost full cold start problem in dyadic prediction, a setting that occurs when both objects in an out-of-sample dyad have not been observed during training, or if one of them has been observed, but very few times. A popular approach for addressing this problem is to train a model that makes predictions based on a pairwise feature representation of the dyads, or, in case of kernel methods, based on a tensor product pairwise kernel. As an alternative to such a kernel approach, we introduce a novel two-step learning algorithm that borrows ideas from the fields of pairwise learning and spectral filtering. We show theoretically that the two-step method is very closely related to the tensor product kernel approach, and experimentally that it yields a slightly better predictive performance. Moreover, unlike existing tensor product kernel methods, the two-step method allows closed-form solutions for training and parameter selection via cross-validation estimates both in the full and almost full cold start settings, making the approach much more efficient and straightforward to implement.
LGMay 17, 2014
Identification of functionally related enzymes by learning-to-rank methodsMichiel Stock, Thomas Fober, Eyke Hüllermeier et al.
Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structure-based measures under certain conditions. We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes.
LGSep 21, 2012
Efficient Regularized Least-Squares Algorithms for Conditional Ranking on Relational DataTapio Pahikkala, Antti Airola, Michiel Stock et al.
In domains like bioinformatics, information retrieval and social network analysis, one can find learning tasks where the goal consists of inferring a ranking of objects, conditioned on a particular target object. We present a general kernel framework for learning conditional rankings from various types of relational data, where rankings can be conditioned on unseen data objects. We propose efficient algorithms for conditional ranking by optimizing squared regression and ranking loss functions. We show theoretically, that learning with the ranking loss is likely to generalize better than with the regression loss. Further, we prove that symmetry or reciprocity properties of relations can be efficiently enforced in the learned models. Experiments on synthetic and real-world data illustrate that the proposed methods deliver state-of-the-art performance in terms of predictive power and computational efficiency. Moreover, we also show empirically that incorporating symmetry or reciprocity properties can improve the generalization performance.