LGAug 31, 2022Code
A Fair Experimental Comparison of Neural Network Architectures for Latent Representations of Multi-Omics for Drug Response PredictionTony Hauptmann, Stefan Kramer
Recent years have seen a surge of novel neural network architectures for the integration of multi-omics data for prediction. Most of the architectures include either encoders alone or encoders and decoders, i.e., autoencoders of various sorts, to transform multi-omics data into latent representations. One important parameter is the depth of integration: the point at which the latent representations are computed or merged, which can be either early, intermediate, or late. The literature on integration methods is growing steadily, however, close to nothing is known about the relative performance of these methods under fair experimental conditions and under consideration of different use cases. We developed a comparison framework that trains and optimizes multi-omics integration methods under equal conditions. We incorporated early integration and four recently published deep learning methods: MOLI, Super.FELT, OmiEmbed, and MOMA. Further, we devised a novel method, Omics Stacking, that combines the advantages of intermediate and late integration. Experiments were conducted on a public drug response data set with multiple omics data (somatic point mutations, somatic copy number profiles and gene expression profiles) that was obtained from cell lines, patient-derived xenografts, and patient samples. Our experiments confirmed that early integration has the lowest predictive performance. Overall, architectures that integrate triplet loss achieved the best results. Statistical differences can, overall, rarely be observed, however, in terms of the average ranks of methods, Super.FELT is consistently performing best in a cross-validation setting and Omics Stacking best in an external test set setting. The source code of all experiments is available under \url{https://github.com/kramerlab/Multi-Omics_analysis}
CLSep 1, 2024Code
Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question AnsweringDerian Boer, Fabian Koch, Stefan Kramer
Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model's ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at https://github.com/kramerlab/4StepFocus.
LGAug 4, 2022
Invariant Representations with Stochastically Quantized Neural NetworksMattia Cerrato, Marius Köppel, Roberto Esposito et al.
Representation learning algorithms offer the opportunity to learn invariant representations of the input data with regard to nuisance factors. Many authors have leveraged such strategies to learn fair representations, i.e., vectors where information about sensitive attributes is removed. These methods are attractive as they may be interpreted as minimizing the mutual information between a neural layer's activations and a sensitive attribute. However, the theoretical grounding of such methods relies either on the computation of infinitely accurate adversaries or on minimizing a variational upper bound of a mutual information estimate. In this paper, we propose a methodology for direct computation of the mutual information between a neural layer and a sensitive attribute. We employ stochastically-activated binary neural networks, which lets us treat neurons as random variables. We are then able to compute (not bound) the mutual information between a layer and a sensitive attribute and use this information as a regularization factor during gradient descent. We show that this method compares favorably with the state of the art in fair representation learning and that the learned representations display a higher level of invariance compared to full-precision neural networks.
LGMar 1Code
Feature-Weighted Maximum Representative SubsamplingTony Hauptmann, Stefan Kramer
In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non-representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non-representative datasets. We validated FW-MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW-MRS on downstream tasks and found no statistically significant differences. Additionally, FW-MRS was applied to a real-world dataset from the social sciences. The source code is available at https://github.com/kramerlab/FeatureWeightDebiasing.
LGJul 4, 2024
10 Years of Fair Representations: Challenges and OpportunitiesMattia Cerrato, Marius Köppel, Philipp Wolf et al.
Fair Representation Learning (FRL) is a broad set of techniques, mostly based on neural networks, that seeks to learn new representations of data in which sensitive or undesired information has been removed. Methodologically, FRL was pioneered by Richard Zemel et al. about ten years ago. The basic concepts, objectives and evaluation strategies for FRL methodologies remain unchanged to this day. In this paper, we look back at the first ten years of FRL by i) revisiting its theoretical standing in light of recent work in deep learning theory that shows the hardness of removing information in neural network representations and ii) presenting the results of a massive experimentation (225.000 model fits and 110.000 AutoML fits) we conducted with the objective of improving on the common evaluation scenario for FRL. More specifically, we use automated machine learning (AutoML) to adversarially "mine" sensitive information from supposedly fair representations. Our theoretical and experimental analysis suggests that deterministic, unquantized FRL methodologies have serious issues in removing sensitive information, which is especially troubling as they might seem "fair" at first glance.
LGNov 4, 2022
Neural RELAGGSLukas Pensel, Stefan Kramer
Multi-relational databases are the basis of most consolidated data collections in science and industry today. Most learning and mining algorithms, however, require data to be represented in a propositional form. While there is a variety of specialized machine learning algorithms that can operate directly on multi-relational data sets, propositionalization algorithms transform multi-relational databases into propositional data sets, thereby allowing the application of traditional machine learning and data mining algorithms without their modification. One prominent propositionalization algorithm is RELAGGS by Krogel and Wrobel, which transforms the data by nested aggregations. We propose a new neural network based algorithm in the spirit of RELAGGS that employs trainable composite aggregate functions instead of the static aggregate functions used in the original approach. In this way, we can jointly train the propositionalization with the prediction model, or, alternatively, use the learned aggegrations as embeddings in other algorithms. We demonstrate the increased predictive performance by comparing N-RELAGGS with RELAGGS and multiple other state-of-the-art algorithms.
IRMay 14, 2025Code
Focus, Merge, Rank: Improved Question Answering Based on Semi-structured Knowledge BasesDerian Boer, Stephen Roth, Stefan Kramer
In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. However, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data, thereby enabling new strategies for knowledge access and use. In this work, we present FocusedRetriever, a modular SKB-based framework for multi-hop question answering. It integrates components (VSS-based entity search, LLM-based generation of Cypher queries and pairwise re-ranking) in a way that enables it to outperform state-of-the-art methods across all three STaRK benchmark test sets, covering diverse domains and multiple performance metrics. The average first-hit rate exceeds that of the second-best method by 25.7%. FocusedRetriever leverages (1) the capacity of Large Language Models (LLMs) to extract relational facts and entity attributes from unstructured text, (2) node set joins to filter answer candidates based on these extracted triplets and constraints, (3) vector similarity search to retrieve and rank relevant unstructured content, and (4) the contextual capabilities of LLMs to finally rank the top-k answers. For generality, we only incorporate base LLMs in FocusedRetriever in our evaluation. However, our analysis of intermediate results highlights several opportunities for further upgrades including finetuning. The source code is publicly available at https://github.com/kramerlab/FocusedRetriever .
LGNov 5, 2025
Prompting Neural-Guided Equation Discovery Based on ResidualsJannis Brugger, Viktor Pfanschilling, David Richter et al.
Neural-guided equation discovery systems use a data set as prompt and predict an equation that describes the data set without extensive search. However, if the equation does not meet the user's expectations, there are few options for getting other equation suggestions without intensive work with the system. To fill this gap, we propose Residuals for Equation Discovery (RED), a post-processing method that improves a given equation in a targeted manner, based on its residuals. By parsing the initial equation to a syntax tree, we can use node-based calculation rules to compute the residual for each subequation of the initial equation. It is then possible to use this residual as new target variable in the original data set and generate a new prompt. If, with the new prompt, the equation discovery system suggests a subequation better than the old subequation on a validation set, we replace the latter by the former. RED is usable with any equation discovery system, is fast to calculate, and is easy to extend for new mathematical operations. In experiments on 53 equations from the Feynman benchmark, we show that it not only helps to improve all tested neural-guided systems, but also all tested classical genetic programming systems.
LGNov 7, 2024
Soft Hoeffding Tree: A Transparent and Differentiable Model on Data StreamsKirsten Köbschall, Lisa Hartung, Stefan Kramer
We propose soft Hoeffding trees (SoHoT) as a new differentiable and transparent model for possibly infinite and changing data streams. Stream mining algorithms such as Hoeffding trees grow based on the incoming data stream, but they currently lack the adaptability of end-to-end deep learning systems. End-to-end learning can be desirable if a feature representation is learned by a neural network and used in a tree, or if the outputs of trees are further processed in a deep learning model or workflow. Different from Hoeffding trees, soft trees can be integrated into such systems due to their differentiability, but are neither transparent nor explainable. Our novel model combines the extensibility and transparency of Hoeffding trees with the differentiability of soft trees. We introduce a new gating function to regulate the balance between univariate and multivariate splits in the tree. Experiments are performed on 20 data streams, comparing SoHoT to standard Hoeffding trees, Hoeffding trees with limited complexity, and soft trees applying a sparse activation function for sample routing. The results show that soft Hoeffding trees outperform Hoeffding trees in estimating class probabilities and, at the same time, maintain transparency compared to soft trees, with relatively small losses in terms of AUROC and cross-entropy. We also demonstrate how to trade off transparency against performance using a hyperparameter, obtaining univariate splits at one end of the spectrum and multivariate splits at the other.
AIFeb 13, 2024
Amplifying Exploration in Monte-Carlo Tree Search by Focusing on the UnknownCedric Derstroff, Jannis Brugger, Jannis Blüml et al.
Monte-Carlo tree search (MCTS) is an effective anytime algorithm with a vast amount of applications. It strategically allocates computational resources to focus on promising segments of the search tree, making it a very attractive search algorithm in large search spaces. However, it often expends its limited resources on reevaluating previously explored regions when they remain the most promising path. Our proposed methodology, denoted as AmEx-MCTS, solves this problem by introducing a novel MCTS formulation. Central to AmEx-MCTS is the decoupling of value updates, visit count updates, and the selected path during the tree search, thereby enabling the exclusion of already explored subtrees or leaves. This segregation preserves the utility of visit counts for both exploration-exploitation balancing and quality metrics within MCTS. The resultant augmentation facilitates in a considerably broader search using identical computational resources, preserving the essential characteristics of MCTS. The expanded coverage not only yields more precise estimations but also proves instrumental in larger and more complex problems. Our empirical evaluation demonstrates the superior performance of AmEx-MCTS, surpassing classical MCTS and related approaches by a substantial margin.
CROct 7, 2025
N-Parties Private Structure and Parameter Learning for Sum-Product NetworksXenia Heilmann, Ernst Althaus, Mattia Cerrato et al.
A sum-product network (SPN) is a graphical model that allows several types of probabilistic inference to be performed efficiently. In this paper, we propose a privacy-preserving protocol which tackles structure generation and parameter learning of SPNs. Additionally, we provide a protocol for private inference on SPNs, subsequent to training. To preserve the privacy of the participants, we derive our protocol based on secret sharing, which guarantees privacy in the honest-but-curious setting even when at most half of the parties cooperate to disclose the data. The protocol makes use of a forest of randomly generated SPNs, which is trained and weighted privately and can then be used for private inference on data points. Our experiments indicate that preserving the privacy of all participants does not decrease log-likelihood performance on both homogeneously and heterogeneously partitioned data. We furthermore show that our protocol's performance is comparable to current state-of-the-art SPN learners in homogeneously partitioned data settings. In terms of runtime and memory usage, we demonstrate that our implementation scales well when increasing the number of parties, comparing favorably to protocols for neural networks, when they are trained to reproduce the input-output behavior of SPNs.
LGSep 23, 2025
Lift What You Can: Green Online Learning with Heterogeneous EnsemblesKirsten Köbschall, Sebastian Buschjäger, Raphael Fischer et al.
Ensemble methods for stream mining necessitate managing multiple models and updating them as data distributions evolve. Considering the calls for more sustainability, established methods are however not sufficiently considerate of ensemble members' computational expenses and instead overly focus on predictive capabilities. To address these challenges and enable green online learning, we propose heterogeneous online ensembles (HEROS). For every training step, HEROS chooses a subset of models from a pool of models initialized with diverse hyperparameter choices under resource constraints to train. We introduce a Markov decision process to theoretically capture the trade-offs between predictive performance and sustainability constraints. Based on this framework, we present different policies for choosing which models to train on incoming data. Most notably, we propose the novel $ζ$-policy, which focuses on training near-optimal models at reduced costs. Using a stochastic model, we theoretically prove that our $ζ$-policy achieves near optimal performance while using fewer resources compared to the best performing policy. In our experiments across 11 benchmark datasets, we find empiric evidence that our $ζ$-policy is a strong contribution to the state-of-the-art, demonstrating highly accurate performance, in some cases even outperforming competitors, and simultaneously being much more resource-friendly.
LGSep 3, 2025
Exploring the Design Space of Fair Tree Learning AlgorithmsKiara Stempel, Mattia Cerrato, Stefan Kramer
Decision trees have been studied extensively in the context of fairness, aiming to maximize prediction performance while ensuring non-discrimination against different groups. Techniques in this space usually focus on imposing constraints at training time, constraining the search space so that solutions which display unacceptable values of relevant metrics are not considered, discarded, or discouraged. If we assume one target variable y and one sensitive attribute s, the design space of tree learning algorithms can be spanned as follows: (i) One can have one tree T that is built using an objective function that is a function of y, s, and T. For instance, one can build a tree based on the weighted information gain regarding y (maximizing) and s (minimizing). (ii) The second option is to have one tree model T that uses an objective function in y and T and a constraint on s and T. Here, s is no longer part of the objective, but part of a constraint. This can be achieved greedily by aborting a further split as soon as the condition that optimizes the objective in y fails to satisfy the constraint on s. A simple way to explore other splits is to backtrack during tree construction once a fairness constraint is violated. (iii) The third option is to have two trees T_y and T_s, one for y and one for s, such that the tree structure for y and s does not have to be shared. In this way, information regarding y and regarding s can be used independently, without having to constrain the choices in tree construction by the mutual information between the two variables. Quite surprisingly, of the three options, only the first one and the greedy variant of the second have been studied in the literature so far. In this paper, we introduce the above two additional options from that design space and characterize them experimentally on multiple datasets.
LGJul 28, 2025
Deep Generative Models of Evolution: SNP-level Population Adaptation by Genomic Linkage IncorporationJulia Siekiera, Christian Schlötterer, Stefan Kramer
The investigation of allele frequency trajectories in populations evolving under controlled environmental pressures has become a popular approach to study evolutionary processes on the molecular level. Statistical models based on well-defined evolutionary concepts can be used to validate different hypotheses about empirical observations. Despite their popularity, classic statistical models like the Wright-Fisher model suffer from simplified assumptions such as the independence of selected loci along a chromosome and uncertainty about the parameters. Deep generative neural networks offer a powerful alternative known for the integration of multivariate dependencies and noise reduction. Due to their high data demands and challenging interpretability they have, so far, not been widely considered in the area of population genomics. To address the challenges in the area of Evolve and Resequencing experiments (E&R) based on pooled sequencing (Pool-Seq) data, we introduce a deep generative neural network that aims to model a concept of evolution based on empirical observations over time. The proposed model estimates the distribution of allele frequency trajectories by embedding the observations from single nucleotide polymorphisms (SNPs) with information from neighboring loci. Evaluation on simulated E&R experiments demonstrates the model's ability to capture the distribution of allele frequency trajectories and illustrates the representational power of deep generative models on the example of linkage disequilibrium (LD) estimation. Inspecting the internally learned representations enables estimating pairwise LD, which is typically inaccessible in Pool-Seq data. Our model provides competitive LD estimation in Pool-Seq data high degree of LD when compared to existing methods.
LGJul 25, 2025
Counterfactual Explanations in Medical Imaging: Exploring SPN-Guided Latent Space ManipulationJulia Siekiera, Stefan Kramer
Artificial intelligence is increasingly leveraged across various domains to automate decision-making processes that significantly impact human lives. In medical image analysis, deep learning models have demonstrated remarkable performance. However, their inherent complexity makes them black box systems, raising concerns about reliability and interpretability. Counterfactual explanations provide comprehensible insights into decision processes by presenting hypothetical "what-if" scenarios that alter model classifications. By examining input alterations, counterfactual explanations provide patterns that influence the decision-making process. Despite their potential, generating plausible counterfactuals that adhere to similarity constraints providing human-interpretable explanations remains a challenge. In this paper, we investigate this challenge by a model-specific optimization approach. While deep generative models such as variational autoencoders (VAEs) exhibit significant generative power, probabilistic models like sum-product networks (SPNs) efficiently represent complex joint probability distributions. By modeling the likelihood of a semi-supervised VAE's latent space with an SPN, we leverage its dual role as both a latent space descriptor and a classifier for a given discrimination task. This formulation enables the optimization of latent space counterfactuals that are both close to the original data distribution and aligned with the target class distribution. We conduct experimental evaluation on the cheXpert dataset. To evaluate the effectiveness of the integration of SPNs, our SPN-guided latent space manipulation is compared against a neural network baseline. Additionally, the trade-off between latent variable regularization and counterfactual quality is analyzed.
AIJun 17, 2025
Enhancing Symbolic Machine Learning by Subsymbolic RepresentationsStephen Roth, Lennart Baur, Derian Boer et al.
The goal of neuro-symbolic AI is to integrate symbolic and subsymbolic AI approaches, to overcome the limitations of either. Prominent systems include Logic Tensor Networks (LTN) or DeepProbLog, which offer neural predicates and end-to-end learning. The versatility of systems like LTNs and DeepProbLog, however, makes them less efficient in simpler settings, for instance, for discriminative machine learning, in particular in domains with many constants. Therefore, we follow a different approach: We propose to enhance symbolic machine learning schemes by giving them access to neural embeddings. In the present paper, we show this for TILDE and embeddings of constants used by TILDE in similarity predicates. The approach can be fine-tuned by further refining the embeddings depending on the symbolic theory. In experiments in three real-world domain, we show that this simple, yet effective, approach outperforms all other baseline methods in terms of the F1 score. The approach could be useful beyond this setting: Enhancing symbolic learners in this way could be extended to similarities between instances (effectively working like kernels within a logical language), for analogical reasoning, or for propositionalization.
AIMar 21, 2025
Neural-Guided Equation DiscoveryJannis Brugger, Mattia Cerrato, David Richter et al.
Deep learning approaches are becoming increasingly attractive for equation discovery. We show the advantages and disadvantages of using neural-guided equation discovery by giving an overview of recent papers and the results of experiments using our modular equation discovery system MGMT ($\textbf{M}$ulti-Task $\textbf{G}$rammar-Guided $\textbf{M}$onte-Carlo $\textbf{T}$ree Search for Equation Discovery). The system uses neural-guided Monte-Carlo Tree Search (MCTS) and supports both supervised and reinforcement learning, with a search space defined by a context-free grammar. We summarize seven desirable properties of equation discovery systems, emphasizing the importance of embedding tabular data sets for such learning approaches. Using the modular structure of MGMT, we compare seven architectures (among them, RNNs, CNNs, and Transformers) for embedding tabular datasets on the auxiliary task of contrastive learning for tabular data sets on an equation discovery task. For almost all combinations of modules, supervised learning outperforms reinforcement learning. Moreover, our experiments indicate an advantage of using grammar rules as action space instead of tokens. Two adaptations of MCTS -- risk-seeking MCTS and AmEx-MCTS -- can improve equation discovery with that kind of search.
LGFeb 21, 2025
Human Guided Learning of Transparent Regression ModelsLukas Pensel, Stefan Kramer
We present a human-in-the-loop (HIL) approach to permutation regression, the novel task of predicting a continuous value for a given ordering of items. The model is a gradient boosted regression model that incorporates simple human-understandable constraints of the form x < y, i.e. item x has to be before item y, as binary features. The approach, HuGuR (Human Guided Regression), lets a human explore the search space of such transparent regression models. Interacting with HuGuR, users can add, remove, and refine order constraints interactively, while the coefficients are calculated on the fly. We evaluate HuGuR in a user study and compare the performance of user-built models with multiple baselines on 9 data sets. The results show that the user-built models outperform the compared methods on small data sets and in general perform on par with the other methods, while being in principle understandable for humans. On larger datasets from the same domain, machine-induced models begin to outperform the user-built models. Further work will study the trust users have in models when constructed by themselves and how the scheme can be transferred to other pattern domains, such as strings, sequences, trees, or graphs.
LGFeb 19, 2025
Integrating Inverse and Forward Modeling for Sparse Temporal Data from Sensor NetworksJulian Vexler, Björn Vieten, Martin Nelke et al.
We present CavePerception, a framework for the analysis of sparse data from sensor networks that incorporates elements of inverse modeling and forward modeling. By integrating machine learning with physical modeling in a hypotheses space, we aim to improve the interpretability of sparse, noisy, and potentially incomplete sensor data. The framework assumes data from a two-dimensional sensor network laid out in a graph structure that detects certain objects, with certain motion patterns. Examples of such sensors are magnetometers. Given knowledge about the objects and the way they act on the sensors, one can develop a data generator that produces data from simulated motions of the objects across the sensor field. The framework uses the simulated data to infer object behaviors across the sensor network. The approach is experimentally tested on real-world data, where magnetometers are used on an airport to detect and identify aircraft motions. Experiments demonstrate the value of integrating inverse and forward modeling, enabling intelligent systems to better understand and predict complex, sensor-driven events.
LGDec 15, 2023
Peer Learning: Learning Complex Policies in Groups from Scratch via Action RecommendationsCedric Derstroff, Mattia Cerrato, Jannis Brugger et al.
Peer learning is a novel high-level reinforcement learning framework for agents learning in groups. While standard reinforcement learning trains an individual agent in trial-and-error fashion, all on its own, peer learning addresses a related setting in which a group of agents, i.e., peers, learns to master a task simultaneously together from scratch. Peers are allowed to communicate only about their own states and actions recommended by others: "What would you do in my situation?". Our motivation is to study the learning behavior of these agents. We formalize the teacher selection process in the action advice setting as a multi-armed bandit problem and therefore highlight the need for exploration. Eventually, we analyze the learning behavior of the peers and observe their ability to rank the agents' performance within the study group and understand which agents give reliable advice. Further, we compare peer learning with single agent learning and a state-of-the-art action advice baseline. We show that peer learning is able to outperform single-agent learning and the baseline in several challenging discrete and continuous OpenAI Gym domains. Doing so, we also show that within such a framework complex policies from action recommendations beyond discrete action spaces can evolve.
AIMay 3, 2023
Automated Scientific Discovery: From Equation Discovery to Autonomous Discovery SystemsStefan Kramer, Mattia Cerrato, Jannis Brugger et al.
The paper surveys automated scientific discovery, from equation discovery and symbolic regression to autonomous discovery systems and agents. It discusses the individual approaches from a "big picture" perspective and in context, but also discusses open issues and recent topics like the various roles of deep neural networks in this area, aiding in the discovery of human-interpretable knowledge. Further, we will present closed-loop scientific discovery systems, starting with the pioneering work on the Adam system up to current efforts in fields from material science to astronomy. Finally, we will elaborate on autonomy from a machine learning perspective, but also in analogy to the autonomy levels in autonomous driving. The maximal level, level five, is defined to require no human intervention at all in the production of scientific knowledge. Achieving this is one step towards solving the Nobel Turing Grand Challenge to develop AI Scientists: AI systems capable of making Nobel-quality scientific discoveries highly autonomously at a level comparable, and possibly superior, to the best human scientists by 2050.
LGFeb 7, 2022
Fair Interpretable Representation Learning with Correction VectorsMattia Cerrato, Alesia Vallenas Coronel, Marius Köppel et al.
Neural network architectures have been extensively employed in the fair representation learning setting, where the objective is to learn a new representation for a given vector which is independent of sensitive information. Various representation debiasing techniques have been proposed in the literature. However, as neural networks are inherently opaque, these methods are hard to comprehend, which limits their usefulness. We propose a new framework for fair representation learning that is centered around the learning of "correction vectors", which have the same dimensionality as the given data vectors. Correction vectors may be computed either explicitly via architectural constraints or implicitly by training an invertible model based on Normalizing Flows. We show experimentally that several fair representation learning models constrained in such a way do not exhibit losses in ranking or classification performance. Furthermore, we demonstrate that state-of-the-art results can be achieved by the invertible model. Finally, we discuss the law standing of our methodology in light of recent legislation in the European Union.
LGJan 17, 2022
Fair Interpretable Learning via Correction VectorsMattia Cerrato, Marius Köppel, Alexander Segner et al.
Neural network architectures have been extensively employed in the fair representation learning setting, where the objective is to learn a new representation for a given vector which is independent of sensitive information. Various "representation debiasing" techniques have been proposed in the literature. However, as neural networks are inherently opaque, these methods are hard to comprehend, which limits their usefulness. We propose a new framework for fair representation learning which is centered around the learning of "correction vectors", which have the same dimensionality as the given data vectors. The corrections are then simply summed up to the original features, and can therefore be analyzed as an explicit penalty or bonus to each feature. We show experimentally that a fair representation learning problem constrained in such a way does not impact performance.
LGJan 17, 2022
Fair Group-Shared Representations with Normalizing FlowsMattia Cerrato, Marius Köppel, Alexander Segner et al.
The issue of fairness in machine learning stems from the fact that historical data often displays biases against specific groups of people which have been underprivileged in the recent past, or still are. In this context, one of the possible approaches is to employ fair representation learning algorithms which are able to remove biases from data, making groups statistically indistinguishable. In this paper, we instead develop a fair representation learning algorithm which is able to map individuals belonging to different groups in a single group. This is made possible by training a pair of Normalizing Flow models and constraining them to not remove information about the ground truth by training a ranking or classification model on top of them. The overall, ``chained'' model is invertible and has a tractable Jacobian, which allows to relate together the probability densities for different groups and ``translate'' individuals from one group to another. We show experimentally that our methodology is competitive with other fair representation learning algorithms. Furthermore, our algorithm achieves stronger invariance w.r.t. the sensitive attribute.
LGApr 15, 2021
Fast Private Parameter Learning and Inference for Sum-Product NetworksErnst Althaus, Mohammad Sadeq Dousti, Stefan Kramer et al.
A sum-product network (SPN) is a graphical model that allows several types of inferences to be drawn efficiently. There are two types of learning for SPNs: Learning the architecture of the model, and learning the parameters. In this paper, we tackle the second problem: We show how to learn the weights for the sum nodes, assuming the architecture is fixed, and the data is horizontally partitioned between multiple parties. The computations will preserve the privacy of each participant. Furthermore, we will use secret sharing instead of (homomorphic) encryption, which allows fast computations and requires little computational resources. To this end, we use a novel integer division to compute approximate real divisions. We also show how simple and private inferences can be performed using the learned SPN.
LGFeb 16, 2021
Pattern Sampling for Shapelet-based Time Series ClassificationAtif Raza, Stefan Kramer
Subsequence-based time series classification algorithms provide accurate and interpretable models, but training these models is extremely computation intensive. The asymptotic time complexity of subsequence-based algorithms remains a higher-order polynomial, because these algorithms are based on exhaustive search for highly discriminative subsequences. Pattern sampling has been proposed as an effective alternative to mitigate the pattern explosion phenomenon. Therefore, we employ pattern sampling to extract discriminative features from discretized time series data. A weighted trie is created based on the discretized time series data to sample highly discriminative patterns. These sampled patterns are used to identify the shapelets which are used to transform the time series classification problem into a feature-based classification problem. Finally, a classification model can be trained using any off-the-shelf algorithm. Creating a pattern sampler requires a small number of patterns to be evaluated compared to an exhaustive search as employed by previous approaches. Compared to previously proposed algorithms, our approach requires considerably less computational and memory resources. Experiments demonstrate how the proposed approach fares in terms of classification accuracy and runtime performance.
IRFeb 3, 2021
Focusing Knowledge-based Graph Argument Mining via Topic ModelingPatrick Abels, Zahra Ahmadi, Sophie Burkhardt et al.
Decision-making usually takes five steps: identifying the problem, collecting data, extracting evidence, identifying pro and con arguments, and making decisions. Focusing on extracting evidence, this paper presents a hybrid model that combines latent Dirichlet allocation and word embeddings to obtain external knowledge from structured and unstructured data. We study the task of sentence-level argument mining, as arguments mostly require some degree of world knowledge to be identified and understood. Given a topic and a sentence, the goal is to classify whether a sentence represents an argument in regard to the topic. We use a topic model to extract topic- and sentence-specific evidence from the structured knowledge base Wikidata, building a graph based on the cosine similarity between the entity word vectors of Wikidata and the vector of the given sentence. Also, we build a second graph based on topic-specific articles found via Google to tackle the general incompleteness of structured knowledge bases. Combining these graphs, we obtain a graph-based model which, as our evaluation shows, successfully capitalizes on both structured and unstructured data.
LGJan 11, 2021
Deep Neural Networks to Recover Unknown Physical Parameters from Oscillating Time SeriesAntoine Garcon, Julian Vexler, Dmitry Budker et al.
Deep neural networks (DNNs) are widely used in pattern-recognition tasks for which a human comprehensible, quantitative description of the data-generating process, e.g., in the form of equations, cannot be achieved. While doing so, DNNs often produce an abstract (entangled and non-interpretable) representation of the data-generating process. This is one of the reasons why DNNs are not extensively used in physics-signal processing: physicists generally require their analyses to yield quantitative information about the studied systems. In this article we use DNNs to disentangle components of oscillating time series, and recover meaningful information. We show that, because DNNs can find useful abstract feature representations, they can be used when prior knowledge about the signal-generating process exists, but is not complete, as it is particularly the case in "new-physics" searches. To this aim, we train our DNN on synthetic oscillating time series to perform two tasks: a regression of the signal latent parameters and signal denoising by an Autoencoder-like architecture. We show that the regression and denoising performance is similar to those of least-square curve fittings (LS-fit) with true latent parameters' initial guesses, in spite of the DNN needing no initial guesses at all. We then explore applications in which we believe our architecture could prove useful for time-series processing in physics, when prior knowledge is incomplete. As an example, we employ DNNs as a tool to inform LS-fits when initial guesses are unknown. We show that the regression can be performed on some latent parameters, while ignoring the existence of others. Because the Autoencoder needs no prior information about the physical model, the remaining unknown latent parameters can still be captured, thus making use of partial prior knowledge, while leaving space for data exploration and discoveries.
GNDec 28, 2020
Deep Unsupervised Identification of Selected SNPs between Adapted Populations on Pool-seq DataJulia Siekiera, Stefan Kramer
The exploration of selected single nucleotide polymorphisms (SNPs) to identify genetic diversity between different sequencing population pools (Pool-seq) is a fundamental task in genetic research. As underlying sequence reads and their alignment are error-prone and univariate statistical solutions only take individual positions of the genome into account, the identification of selected SNPs remains a challenging process. Deep learning models like convolutional neural networks (CNNs) are able to consider large input areas in their decisions. We suggest an unsupervised pipeline to be independent of a rarely known ground truth. We train a supervised discriminator CNN to distinguish alignments from different populations and utilize the model for unsupervised SNP calling by applying explainable artificial intelligence methods. Our proposed multivariate method is based on two main assumptions: We assume (i) that instances having a high predictive certainty of being distinguishable are likely to contain genetic variants, and (ii) that selected SNPs are located at regions with input features having the highest influence on the model's decision process. We directly compare our method with statistical results on two different Pool-seq datasets and show that our solution is able to extend statistical results.
LGDec 15, 2020
Rule Extraction from Binary Neural Networks with Convolutional Rules for Model ValidationSophie Burkhardt, Jannis Brugger, Nicolas Wagner et al.
Most deep neural networks are considered to be black boxes, meaning their output is hard to interpret. In contrast, logical expressions are considered to be more comprehensible since they use symbols that are semantically close to natural language instead of distributed representations. However, for high-dimensional input data such as images, the individual symbols, i.e. pixels, are not easily interpretable. We introduce the concept of first-order convolutional rules, which are logical rules that can be extracted using a convolutional neural network (CNN), and whose complexity depends on the size of the convolutional filter and not on the dimensionality of the input. Our approach is based on rule extraction from binary neural networks with stochastic local search. We show how to extract rules that are not necessarily short, but characteristic of the input, and easy to visualize. Our experiments show that the proposed approach is able to model the functionality of the neural network while at the same time producing interpretable logical rules.
CLOct 23, 2020
Ranking Creative Language Characteristics in Small Data ScenariosJulia Siekiera, Marius Köppel, Edwin Simpson et al.
The ability to rank creative natural language provides an important general tool for downstream language understanding and generation. However, current deep ranking models require substantial amounts of labeled data that are difficult and expensive to obtain for different domains, languages and creative characteristics. A recent neural approach, the DirectRanker, promises to reduce the amount of training data needed but its application to text isn't fully explored. We therefore adapt the DirectRanker to provide a new deep model for ranking creative language with small data. We compare DirectRanker with a Bayesian approach, Gaussian process preference learning (GPPL), which has previously been shown to work well with sparse data. Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small training datasets, DirectRanker remains effective. We find that combining DirectRanker with GPPL increases performance across different settings by leveraging the complementary benefits of both models. Our combined approach outperforms the previous state-of-the-art on humor and metaphor novelty tasks, increasing Spearman's $ρ$ by 14% and 16% on average.
CRJun 2, 2020
Secure Sum Outperforms Homomorphic Encryption in (Current) Collaborative Deep LearningDerian Boer, Stefan Kramer
Deep learning (DL) approaches are achieving extraordinary results in a wide range of domains, but often require a massive collection of private data. Hence, methods for training neural networks on the joint data of different data owners, that keep each party's input confidential, are called for. We address a specific setting in federated learning, namely that of deep learning from horizontally distributed data with a limited number of parties, where their vulnerable intermediate results have to be processed in a privacy-preserving manner. This setting can be found in medical and healthcare as well as industrial applications. The predominant scheme for this is based on homomorphic encryption (HE), and it is widely considered to be without alternative. In contrast to this, we demonstrate that a carefully chosen, less complex and computationally less expensive secure sum protocol in conjunction with default secure channels exhibits superior properties in terms of both collusion-resistance and runtime. Finally, we discuss several open research questions in the context of collaborative DL, especially regarding privacy risks caused by joint intermediate results.
SEMar 2, 2020
Towards Probability-based Safety Verification of Systems with Components from Machine LearningHermann Kaindl, Stefan Kramer
Machine learning (ML) has recently created many new success stories. Hence, there is a strong motivation to use ML technology in software-intensive systems, including safety-critical systems. This raises the issue of safety verification of MLbased systems, which is currently thought to be infeasible or, at least, very hard. We think that it requires taking into account specific properties of ML technology such as: (i) Most ML approaches are inductive, which is both their power and their source of error. (ii) Neural networks (NN) resulting from deep learning are at the current state of the art not transparent. Consequently, there will always be errors remaining and, at least for deep NNs (DNNs), verification of their internal structure is extremely hard. In general, safety engineering cannot provide full guarantees that no harm will ever occur. That is why probabilities are used, e.g., for specifying a risk or a Tolerable Hazard Rate (THR). In this vision paper, we propose verification based on probabilities of errors both estimated by controlled experiments and output by the inductively learned classifier itself. Generalization error bounds may propagate to the probabilities of a hazard, which must not exceed a THR. As a result, the quantitatively determined bound on the probability of a classification error of an ML component in a safety-critical system contributes in a well-defined way to the latter's overall safety verification.
IRSep 6, 2019
Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical PerformanceMarius Köppel, Alexander Segner, Martin Wagener et al.
We present a pairwise learning to rank approach based on a neural net, called DirectRanker, that generalizes the RankNet architecture. We show mathematically that our model is reflexive, antisymmetric, and transitive allowing for simplified training and improved performance. Experimental results on the LETOR MSLR-WEB10K, MQ2007 and MQ2008 datasets show that our model outperforms numerous state-of-the-art methods, while being inherently simpler in structure and using a pairwise approach only.
LGApr 4, 2018
Online Multi-Label Classification: A Label Compression MethodZahra Ahmadi, Stefan Kramer
Many modern applications deal with multi-label data, such as functional categorizations of genes, image labeling and text categorization. Classification of such data with a large number of labels and latent dependencies among them is a challenging task, and it becomes even more challenging when the data is received online and in chunks. Many of the current multi-label classification methods require a lot of time and memory, which make them infeasible for practical real-world applications. In this paper, we propose a fast linear label space dimension reduction method that transforms the labels into a reduced encoded space and trains models on the obtained pseudo labels. Additionally, it provides an analytical method to update the decoding matrix which maps the labels into the original space and is used during the test phase. Experimental results show the effectiveness of this approach in terms of running times and the prediction performance over different measures.
LGFeb 22, 2017
Ensembles of Randomized Time Series Shapelets Provide Improved Accuracy while Reducing Computational CostsAtif Raza, Stefan Kramer
Shapelets are discriminative time series subsequences that allow generation of interpretable classification models, which provide faster and generally better classification than the nearest neighbor approach. However, the shapelet discovery process requires the evaluation of all possible subsequences of all time series in the training set, making it extremely computation intensive. Consequently, shapelet discovery for large time series datasets quickly becomes intractable. A number of improvements have been proposed to reduce the training time. These techniques use approximation or discretization and often lead to reduced classification accuracy compared to the exact method. We are proposing the use of ensembles of shapelet-based classifiers obtained using random sampling of the shapelet candidates. Using random sampling reduces the number of evaluated candidates and consequently the required computational cost, while the classification accuracy of the resulting models is also not significantly different than that of the exact algorithm. The combination of randomized classifiers rectifies the inaccuracies of individual models because of the diversity of the solutions. Based on the experiments performed, it is shown that the proposed approach of using an ensemble of inexpensive classifiers provides better classification accuracy compared to the exact method at a significantly lesser computational cost.