Jan Ewald

h-index3
2papers

2 Papers

85.4LGJun 3Code
BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

Luca Thale-Bombien, Jan Ewald, Ralf König et al.

The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), are increasingly used for dimensionality reduction and representation learning in this domain. However, AEs are highly sensitive to architectural choices and hyperparameters, and unsupervised optimization typically relies on reconstruction loss, which may be a poor proxy for downstream utility. Exhaustive hyperparameter optimization (HPO) is computationally expensive, leading researchers to frequently rely on suboptimal default configurations. To democratize access to large-scale unsupervised HPO research, we introduce $\textbf{BBOmix}$, the first open-source tabular benchmark for unsupervised representation learning on real-world biological data. Our benchmark includes 105,000 evaluations across four AE architectures and seven multi-omics modalities from the TCGA and SCHC datasets. We quantify the correlation between reconstruction loss and downstream task performance and provide an extensive evaluation of state-of-the-art single-fidelity, multi-fidelity, and transfer learning HPO methods, establishing a rigorous baseline for future research in unsupervised biological representation learning.

LGFeb 22, 2024
Federated Learning in Genetics: Extended Analysis of Accuracy, Performance and Privacy Trade-offs

Anika Hannemann, Jan Ewald, Leo Seeger et al.

Machine learning on large-scale genomic or transcriptomic data is important for many novel health applications. For example, precision medicine tailors medical treatments to patients on the basis of individual biomarkers, cellular and molecular states, etc. However, the data required is sensitive, voluminous, heterogeneous, and typically distributed across locations where dedicated machine learning hardware is not available. Due to privacy and regulatory reasons, it is also problematic to aggregate all data at a trusted third party. Federated learning is a promising solution to this dilemma, because it enables decentralized, collaborative machine learning without exchanging raw data. In this paper, we perform comparative experiments with the federated learning frameworks TensorFlow Federated and Flower. Our test case is the training of disease prognosis and cell type classification models. We train the models with distributed transcriptomic data, considering both data heterogeneity and architectural heterogeneity. We measure model quality, robustness against privacy-enhancing noise and computational performance. We evaluate the resource overhead of a federated system from both client and global perspectives and assess benefits and limitations. Each of the federated learning frameworks has different strengths. However, our experiments confirm that both frameworks can readily build models on transcriptomic data, without transferring personal raw data to a third party with abundant computational resources. This paper is the extended version of https://link.springer.com/chapter/10.1007/978-3-031-63772-8_26.