LGMay 24, 2022
Federated singular value decomposition for high dimensional dataAnne Hartebrodt, Richard Röttger, David B. Blumenthal
Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical data, the high dimensionality poses a challenge in the context of federated genome-wide association studies (GWAS). In this article, we present a federated singular value decomposition (SVD) algorithm, suitable for the privacy-related and computational requirements of GWAS. Notably, the algorithm has a transmission cost independent of the number of samples and is only weakly dependent on the number of features, because the singular vectors associated with the samples are never exchanged and the vectors associated with the features only for a fixed number of iterations. Although motivated by GWAS, the algorithm is generically applicable for both horizontally and vertically partitioned data.
CYFeb 27
How Meta-research Can Pave the Road Towards Trustworthy AI In Healthcare: Catalogue of Ideas and Roadmap for Future ResearchValerie Bürger, Marlie Besouw, Jana Fehr et al.
Meta-research and Trustworthy AI (TAI) share common goals, namely improving evidence, robustness, and transparency, yet there is very little interplay between the two fields. To investigate the potential benefits of closer collaboration between the domains of TAI in healthcare and meta-research, we convened an interdisciplinary workshop funded by the Volkswagen Foundation in February 2025. The workshop aimed to collaboratively examine key tensions in translating AI ethics principles into practice and to identify potential solutions informed by meta-research approaches. A Design Thinking-informed co-creation approach was followed by an inductive descriptive analysis of the outputs. Our results demonstrate how meta-research can offer concrete contributions to address pressing challenges of TAI in healthcare. These challenges include achieving robustness, reproducibility, and replicability; late-stage development and the integration of AI into clinical practice; the selection of appropriate evaluation metrics; specific AI-related challenges in preclinical and biomedical research; gaps of transparency in medical AI, as well as the need for improved conceptual clarity and AI literacy among stakeholders. Finally, we offer a catalog of ideas and roadmap for future research to inform scholars in both fields on existing interconnections and serve as a foundation for guiding future interdisciplinary efforts.
LGJul 31, 2024
UnPaSt: unsupervised patient stratification by biclustering of omics dataMichael Hartung, Andreas Maier, Yuliya Burankova et al.
Unsupervised patient stratification is essential for disease subtype discovery, yet, despite growing evidence of molecular heterogeneity of non-oncological diseases, popular methods are benchmarked primarily using cancers with mutually exclusive molecular subtypes well-differentiated by numerous biomarkers. Evaluating 22 unsupervised methods, including clustering and biclustering, using simulated and real transcriptomics data revealed their inefficiency in scenarios with non-mutually exclusive subtypes or subtypes discriminated only by few biomarkers. To address these limitations and advance precision medicine, we developed UnPaSt, a novel biclustering algorithm for unsupervised patient stratification based on differentially expressed biclusters. UnPaSt outperformed widely used patient stratification approaches in the de novo identification of known subtypes of breast cancer and asthma. In addition, it detected many biologically insightful patterns across bulk transcriptomics, proteomics, single-cell, spatial transcriptomics, and multi-omics datasets, enabling a more nuanced and interpretable view of high-throughput data heterogeneity than traditionally used methods.
CVJul 30, 2021
The Minimum Edit Arborescence Problem and Its Use in Compressing Graph Collections [Extended Version]Lucas Gnecco, Nicolas Boria, Sébastien Bougleux et al.
The inference of minimum spanning arborescences within a set of objects is a general problem which translates into numerous application-specific unsupervised learning tasks. We introduce a unified and generic structure called edit arborescence that relies on edit paths between data in a collection, as well as the Min Edit Arborescence Problem, which asks for an edit arborescence that minimizes the sum of costs of its inner edit paths. Through the use of suitable cost functions, this generic framework allows to model a variety of problems. In particular, we show that by introducing encoding size preserving edit costs, it can be used as an efficient method for compressing collections of labeled graphs. Experiments on various graph datasets, with comparisons to standard compression tools, show the potential of our method.
LGNov 13, 2020
Federated Multi-Mini-Batch: An Efficient Training Approach to Federated Learning in Non-IID EnvironmentsReza Nasirigerdeh, Mohammad Bakhtiari, Reihaneh Torkzadehmahani et al.
Federated learning has faced performance and network communication challenges, especially in the environments where the data is not independent and identically distributed (IID) across the clients. To address the former challenge, we introduce the federated-centralized concordance property and show that the federated single-mini-batch training approach can achieve comparable performance as the corresponding centralized training in the Non-IID environments. To deal with the latter, we present the federated multi-mini-batch approach and illustrate that it can establish a trade-off between the performance and communication efficiency and outperforms federated averaging in the Non-IID settings.
CRJul 22, 2020
Privacy-preserving Artificial Intelligence Techniques in BiomedicineReihaneh Torkzadehmahani, Reza Nasirigerdeh, David B. Blumenthal et al.
Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g. in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
DSAug 1, 2019
New Techniques for Graph Edit Distance ComputationDavid B. Blumenthal
Due to their capacity to encode rich structural information, labeled graphs are often used for modeling various kinds of objects such as images, molecules, and chemical compounds. If pattern recognition problems such as clustering and classification are to be solved on these domains, a (dis-)similarity measure for labeled graphs has to be defined. A widely used measure is the graph edit distance (GED), which, intuitively, is defined as the minimum amount of distortion that has to be applied to a source graph in order to transform it into a target graph. The main advantage of GED is its flexibility and sensitivity to small differences between the input graphs. Its main drawback is that it is hard to compute. In this thesis, new results and techniques for several aspects of computing GED are presented. Firstly, theoretical aspects are discussed: competing definitions of GED are harmonized, the problem of computing GED is characterized in terms of complexity, and several reductions from GED to the quadratic assignment problem (QAP) are presented. Secondly, solvers for the linear sum assignment problem with error-correction (LSAPE) are discussed. LSAPE is a generalization of the well-known linear sum assignment problem (LSAP), and has to be solved as a subproblem by many GED algorithms. In particular, a new solver is presented that efficiently reduces LSAPE to LSAP. Thirdly, exact algorithms for computing GED are presented in a systematic way, and improvements of existing algorithms as well as a new mixed integer programming (MIP) based approach are introduced. Fourthly, a detailed overview of heuristic algorithms that approximate GED via upper and lower bounds is provided, and eight new heuristics are described. Finally, a new easily extensible C++ library for exactly or approximately computing GED is presented.
DSJul 5, 2019
Improved local search for graph edit distanceNicolas Boria, David B. Blumenthal, Sébastien Bougleux et al.
The graph edit distance (GED) measures the dissimilarity between two graphs as the minimal cost of a sequence of elementary operations transforming one graph into another. This measure is fundamental in many areas such as structural pattern recognition or classification. However, exactly computing GED is NP-hard. Among different classes of heuristic algorithms that were proposed to compute approximate solutions, local search based algorithms provide the tightest upper bounds for GED. In this paper, we present K-REFINE and RANDPOST. K-REFINE generalizes and improves an existing local search algorithm and performs particularly well on small graphs. RANDPOST is a general warm start framework that stochastically generates promising initial solutions to be used by any local search based GED algorithm. It is particularly efficient on large graphs. An extensive empirical evaluation demonstrates that both K-REFINE and RANDPOST perform excellently in practice.