LGFeb 20, 2023
Efficient Generator of Mathematical Expressions for Symbolic RegressionSebastian Mežnar, Sašo Džeroski, Ljupčo Todorovski
We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathematical expressions and can accurately encode expressions into a smooth low-dimensional latent space. The latter can be efficiently explored with various optimization methods to address the task of symbolic regression. Indeed, random search through the latent space of HVAE performs better than random search through expressions generated by manually crafted probabilistic grammars for mathematical expressions. Finally, EDHiE system for symbolic regression, which applies an evolutionary algorithm to the latent space of HVAE, reconstructs equations from a standard symbolic regression benchmark better than a state-of-the-art system based on a similar combination of deep learning and evolutionary algorithms.ž
NEApr 15, 2022
The Importance of Landscape Features for Performance Prediction of Modular CMA-ES VariantsAna Kostovska, Diederick Vermetten, Sašo Džeroski et al.
Selecting the most suitable algorithm and determining its hyperparameters for a given optimization problem is a challenging task. Accurately predicting how well a certain algorithm could solve the problem is hence desirable. Recent studies in single-objective numerical optimization show that supervised machine learning methods can predict algorithm performance using landscape features extracted from the problem instances. Existing approaches typically treat the algorithms as black-boxes, without consideration of their characteristics. To investigate in this work if a selection of landscape features that depends on algorithms properties could further improve regression accuracy, we regard the modular CMA-ES framework and estimate how much each landscape feature contributes to the best algorithm performance regression models. Exploratory data analysis performed on this data indicate that the set of most relevant features does not depend on the configuration of individual modules, but the influence that these features have on regression accuracy does. In addition, we have shown that by using classifiers that take the features relevance on the model accuracy, we are able to predict the status of individual modules in the CMA-ES configurations.
NEJan 24, 2023
Using Knowledge Graphs for Performance Prediction of Modular Optimization AlgorithmsAna Kostovska, Diederick Vermetten, Sašo Džeroski et al.
Empirical data plays an important role in evolutionary computation research. To make better use of the available data, ontologies have been proposed in the literature to organize their storage in a structured way. However, the full potential of these formal methods to capture our domain knowledge has yet to be demonstrated. In this work, we evaluate a performance prediction model built on top of the extension of the recently proposed OPTION ontology. More specifically, we first extend the OPTION ontology with the vocabulary needed to represent modular black-box optimization algorithms. Then, we use the extended OPTION ontology, to create knowledge graphs with fixed-budget performance data for two modular algorithm frameworks, modCMA, and modDE, for the 24 noiseless BBOB benchmark functions. We build the performance prediction model using a knowledge graph embedding-based methodology. Using a number of different evaluation scenarios, we show that a triple classification approach, a fairly standard predictive modeling task in the context of knowledge graphs, can correctly predict whether a given algorithm instance will be able to achieve a certain target precision for a given problem instance. This approach requires feature representation of algorithms and problems. While the latter is already well developed, we hope that our work will motivate the community to collaborate on appropriate algorithm representations.
LGJul 19, 2022
Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label ClassificationJurica Levatić, Michelangelo Ceci, Dragi Kocev et al.
Semi-supervised learning (SSL) is a common approach to learning predictive models using not only labeled examples, but also unlabeled examples. While SSL for the simple tasks of classification and regression has received a lot of attention from the research community, this is not properly investigated for complex prediction tasks with structurally dependent variables. This is the case of multi-label classification and hierarchical multi-label classification tasks, which may require additional information, possibly coming from the underlying distribution in the descriptive space provided by unlabeled examples, to better face the challenging task of predicting simultaneously multiple class labels. In this paper, we investigate this aspect and propose a (hierarchical) multi-label classification method based on semi-supervised learning of predictive clustering trees. We also extend the method towards ensemble learning and propose a method based on the random forest approach. Extensive experimental evaluation conducted on 23 datasets shows significant advantages of the proposed method and its extension with respect to their supervised counterparts. Moreover, the method preserves interpretability and reduces the time complexity of classical tree-based models.
QUANT-PHMar 4, 2022
Boosting the Performance of Quantum Annealers using Machine LearningJure Brence, Dragan Mihailović, Viktor Kabanov et al.
Noisy intermediate-scale quantum (NISQ) devices are spearheading the second quantum revolution. Of these, quantum annealers are the only ones currently offering real world, commercial applications on as many as 5000 qubits. The size of problems that can be solved by quantum annealers is limited mainly by errors caused by environmental noise and intrinsic imperfections of the processor. We address the issue of intrinsic imperfections with a novel error correction approach, based on machine learning methods. Our approach adjusts the input Hamiltonian to maximize the probability of finding the solution. In our experiments, the proposed error correction method improved the performance of annealing by up to three orders of magnitude and enabled the solving of a previously intractable, maximally complex problem.
LGNov 21, 2022
Explainable Model-specific Algorithm Selection for Multi-Label ClassificationAna Kostovska, Carola Doerr, Sašo Džeroski et al.
Multi-label classification (MLC) is an ML task of predictive modeling in which a data instance can simultaneously belong to multiple classes. MLC is increasingly gaining interest in different application domains such as text mining, computer vision, and bioinformatics. Several MLC algorithms have been proposed in the literature, resulting in a meta-optimization problem that the user needs to address: which MLC approach to select for a given dataset? To address this algorithm selection problem, we investigate in this work the quality of an automated approach that uses characteristics of the datasets - so-called features - and a trained algorithm selector to choose which algorithm to apply for a given task. For our empirical evaluation, we use a portfolio of 38 datasets. We consider eight MLC algorithms, whose quality we evaluate using six different performance metrics. We show that our automated algorithm selector outperforms any of the single MLC algorithms, and this is for all evaluated performance measures. Our selection approach is explainable, a characteristic that we exploit to investigate which meta-features have the largest influence on the decisions made by the algorithm selector. Finally, we also quantify the importance of the most significant meta-features for various domains.
NEJan 7
Quantifying the Impact of Modules and Their Interactions in the PSO-X FrameworkChristian L. Camacho-Villalón, Ana Nikolikj, Katharina Dost et al.
The PSO-X framework incorporates dozens of modules that have been proposed for solving single-objective continuous optimization problems using particle swarm optimization. While modular frameworks enable users to automatically generate and configure algorithms tailored to specific optimization problems, the complexity of this process increases with the number of modules in the framework and the degrees of freedom defined for their interaction. Understanding how modules affect the performance of algorithms for different problems is critical to making the process of finding effective implementations more efficient and identifying promising areas for further investigation. Despite their practical applications and scientific relevance, there is a lack of empirical studies investigating which modules matter most in modular optimization frameworks and how they interact. In this paper, we analyze the performance of 1424 particle swarm optimization algorithms instantiated from the PSO-X framework on the 25 functions in the CEC'05 benchmark suite with 10 and 30 dimensions. We use functional ANOVA to quantify the impact of modules and their combinations on performance in different problem classes. In practice, this allows us to identify which modules have greater influence on PSO-X performance depending on problem features such as multimodality, mathematical transformations and varying dimensionality. We then perform a cluster analysis to identify groups of problem classes that share similar module effect patterns. Our results show low variability in the importance of modules in all problem classes, suggesting that particle swarm optimization performance is driven by a few influential modules.
LGJan 30
Uncertainty-Aware Extrapolation in Bayesian Oblique TreesViktor Andonovikj, Sašo Džeroski, Pavle Boškoski
Decision trees are widely used due to their interpretability and efficiency, but they struggle in regression tasks that require reliable extrapolation and well-calibrated uncertainty. Piecewise-constant leaf predictions are bounded by the training targets and often become overconfident under distribution shift. We propose a single-tree Bayesian model that extends VSPYCT by equipping each leaf with a GP predictor. Bayesian oblique splits provide uncertainty-aware partitioning of the input space, while GP leaves model local functional behaviour and enable principled extrapolation beyond the observed target range. We present an efficient inference and prediction scheme that combines posterior sampling of split parameters with \gls{gp} posterior predictions, and a gating mechanism that activates GP-based extrapolation when inputs fall outside the training support of a leaf. Experiments on benchmark regression tasks show improvements in the predictive performance compared to standard variational oblique trees, and substantial performance gains in extrapolation scenarios.
LGNov 23, 2022
FAIRification of MLC dataAna Kostovska, Jasmin Bogatinovski, Andrej Treven et al.
The multi-label classification (MLC) task has increasingly been receiving interest from the machine learning (ML) community, as evidenced by the growing number of papers and methods that appear in the literature. Hence, ensuring proper, correct, robust, and trustworthy benchmarking is of utmost importance for the further development of the field. We believe that this can be achieved by adhering to the recently emerged data management standards, such as the FAIR (Findable, Accessible, Interoperable, and Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability, and Technology) principles. To FAIRify the MLC datasets, we introduce an ontology-based online catalogue of MLC datasets that follow these principles. The catalogue extensively describes many MLC datasets with comprehensible meta-features, MLC-specific semantic descriptions, and different data provenance information. The MLC data catalogue is extensively described in our recent publication in Nature Scientific Reports, Kostovska & Bogatinovski et al., and available at: http://semantichub.ijs.si/MLCdatasets. In addition, we provide an ontology-based system for easy access and querying of performance/benchmark data obtained from a comprehensive MLC benchmark study. The system is available at: http://semantichub.ijs.si/MLCbenchmark.
CVMar 12
HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image ClassificationMarjan Stoimchev, Boshko Koloski, Jurica Levatić et al.
Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.
AIApr 19
On Solving the Multiple Variable Gapped Longest Common Subsequence ProblemMarko Djukanović, Nikola Balaban, Christian Blum et al.
This paper addresses the Variable Gapped Longest Common Subsequence (VGLCS) problem, a generalization of the classical LCS problem involving flexible gap constraints between consecutive solutions' characters. The problem arises in molecular sequence comparison, where structural distance constraints between residues must be respected, and in time-series analysis where events are required to occur within specified temporal delays. We propose a search framework based on the root-based state graph representation, in which the state space comprises a generally large number of rooted state subgraphs. To cope with the resulting combinatorial explosion, an iterative beam search strategy is employed, dynamically maintaining a global pool of promising candidate root nodes, enabling effective control of diversification across iterations. To exploit the search for high-quality solutions, several known heuristics from the LCS literature are utilized into the standalone beam search procedure. To the best of our knowledge, this is the first comprehensive computational study on the VGLCS problem comprising 320 synthetic instances with up to 10 input sequences and up to 500 characters. Experimental results show robustness of the designed approach over the baseline beam search in comparable runtimes.
CLSep 26, 2025Code
FoodSEM: Large Language Model Specialized in Food Named-Entity LinkingAna Gjorgjevikj, Matej Martinc, Gjorgjina Cenikj et al.
This paper introduces FoodSEM, a state-of-the-art fine-tuned open-source large language model (LLM) for named-entity linking (NEL) to food-related ontologies. To the best of our knowledge, food NEL is a task that cannot be accurately solved by state-of-the-art general-purpose (large) language models or custom domain-specific models/systems. Through an instruction-response (IR) scenario, FoodSEM links food-related entities mentioned in a text to several ontologies, including FoodOn, SNOMED-CT, and the Hansard taxonomy. The FoodSEM model achieves state-of-the-art performance compared to related models/systems, with F1 scores even reaching 98% on some ontologies and datasets. The presented comparative analyses against zero-shot, one-shot, and few-shot LLM prompting baselines further highlight FoodSEM's superior performance over its non-fine-tuned version. By making FoodSEM and its related resources publicly available, the main contributions of this article include (1) publishing a food-annotated corpora into an IR format suitable for LLM fine-tuning/evaluation, (2) publishing a robust model to advance the semantic understanding of text in the food domain, and (3) providing a strong baseline on food NEL for future benchmarking.
AIAug 21, 2024
Quantifying Behavioral Dissimilarity Between Mathematical ExpressionsSebastian Mežnar, Sašo Džeroski, Ljupčo Todorovski
Quantifying the similarity between mathematical expressions is a fundamental problem in computational mathematics, symbolic reasoning, and scientific discovery. While behavioral notions of similarity have previously been explored in the context of software and program analysis, existing measures for mathematical expressions rely primarily on syntactic form, assessing similarity through symbolic structure rather than actual behavior. Yet syntactically distinct expressions can exhibit nearly identical outputs, while structurally similar ones may behave very differently-especially when the expressions contain free parameters that define families of functions. To address these limitations, we introduce Behavior-aware Expression Dissimilarity (BED), a principled framework for quantifying behavioral distance between mathematical expressions with free parameters. BED represents expressions as joint probability distributions over their input-output pairs and applies the Wasserstein distance to measure behavioral dissimilarity. A computationally efficient stochastic approximation is proposed and shown to be consistent, robust, and capable of inducing a smoother, more meaningful structure over the space of expressions than syntax-based measures. The approach provides a foundation for behavior-based comparison, clustering, and learning of mathematical expressions, with potential direct applications in equation discovery, symbolic regression, and neuro-symbolic modeling.
AIJun 19, 2025
Geometric Learning in Black-Box Optimization: A GNN Framework for Algorithm Performance PredictionAna Kostovska, Carola Doerr, Sašo Džeroski et al.
Automated algorithm performance prediction in numerical blackbox optimization often relies on problem characterizations, such as exploratory landscape analysis features. These features are typically used as inputs to machine learning models and are represented in a tabular format. However, such approaches often overlook algorithm configurations, a key factor influencing performance. The relationships between algorithm operators, parameters, problem characteristics, and performance outcomes form a complex structure best represented as a graph. This work explores the use of heterogeneous graph data structures and graph neural networks to predict the performance of optimization algorithms by capturing the complex dependencies between problems, algorithm configurations, and performance outcomes. We focus on two modular frameworks, modCMA-ES and modDE, which decompose two widely used derivative-free optimization algorithms: the covariance matrix adaptation evolution strategy (CMA-ES) and differential evolution (DE). We evaluate 324 modCMA-ES and 576 modDE variants on 24 BBOB problems across six runtime budgets and two problem dimensions. Achieving up to 36.6% improvement in MSE over traditional tabular-based methods, this work highlights the potential of geometric learning in black-box optimization.
CVMay 15, 2023
Artificial intelligence to advance Earth observation: : A review of models, recent trends, and pathways forwardDevis Tuia, Konrad Schindler, Begüm Demir et al.
Earth observation (EO) is a prime instrument for monitoring land and ocean processes, studying the dynamics at work, and taking the pulse of our planet. This article gives a bird's eye view of the essential scientific tools and approaches informing and supporting the transition from raw EO data to usable EO-based information. The promises, as well as the current challenges of these developments, are highlighted under dedicated sections. Specifically, we cover the impact of (i) Computer vision; (ii) Machine learning; (iii) Advanced processing and computing; (iv) Knowledge-based AI; (v) Explainable AI and causal inference; (vi) Physics-aware models; (vii) User-centric approaches; and (viii) the much-needed discussion of ethical and societal issues related to the massive use of ML technologies in EO.
AIMay 3, 2023
Automated Scientific Discovery: From Equation Discovery to Autonomous Discovery SystemsStefan Kramer, Mattia Cerrato, Jannis Brugger et al.
The paper surveys automated scientific discovery, from equation discovery and symbolic regression to autonomous discovery systems and agents. It discusses the individual approaches from a "big picture" perspective and in context, but also discusses open issues and recent topics like the various roles of deep neural networks in this area, aiding in the discovery of human-interpretable knowledge. Further, we will present closed-loop scientific discovery systems, starting with the pioneering work on the Adam system up to current efforts in fields from material science to astronomy. Finally, we will elaborate on autonomy from a machine learning perspective, but also in analogy to the autonomy levels in autonomous driving. The maximal level, level five, is defined to require no human intervention at all in the production of scientific knowledge. Achieving this is one step towards solving the Nobel Turing Grand Challenge to develop AI Scientists: AI systems capable of making Nobel-quality scientific discoveries highly autonomously at a level comparable, and possibly superior, to the best human scientists by 2050.
LGNov 25, 2021
Unsupervised Feature Ranking via Attribute NetworksUrh Primožič, Blaž Škrlj, Sašo Džeroski et al.
The need for learning from unlabeled data is increasing in contemporary machine learning. Methods for unsupervised feature ranking, which identify the most important features in such data are thus gaining attention, and so are their applications in studying high throughput biological experiments or user bases for recommender systems. We propose FRANe (Feature Ranking via Attribute Networks), an unsupervised algorithm capable of finding key features in given unlabeled data set. FRANe is based on ideas from network reconstruction and network analysis. FRANe performs better than state-of-the-art competitors, as we empirically demonstrate on a large collection of benchmarks. Moreover, we provide the time complexity analysis of FRANe further demonstrating its scalability. Finally, FRANe offers as the result the interpretable relational structures used to derive the feature importances.
LGAug 3, 2021
GalaxAI: Machine learning toolbox for interpretable analysis of spacecraft telemetry dataAna Kostovska, Matej Petković, Tomaž Stepišnik et al.
We present GalaxAI - a versatile machine learning toolbox for efficient and interpretable end-to-end analysis of spacecraft telemetry data. GalaxAI employs various machine learning algorithms for multivariate time series analyses, classification, regression and structured output prediction, capable of handling high-throughput heterogeneous data. These methods allow for the construction of robust and accurate predictive models, that are in turn applied to different tasks of spacecraft monitoring and operations planning. More importantly, besides the accurate building of models, GalaxAI implements a visualisation layer, providing mission specialists and operators with a full, detailed and interpretable view of the data analysis process. We show the utility and versatility of GalaxAI on two use-cases concerning two different spacecraft: i) analysis and planning of Mars Express thermal power consumption and ii) predicting of INTEGRAL's crossings through Van Allen belts.
LGJun 28, 2021
Explaining the Performance of Multi-label Classification Methods with Data Set PropertiesJasmin Bogatinovski, Ljupčo Todorovski, Sašo Džeroski et al.
Meta learning generalizes the empirical experience with different learning tasks and holds promise for providing important empirical insight into the behaviour of machine learning algorithms. In this paper, we present a comprehensive meta-learning study of data sets and methods for multi-label classification (MLC). MLC is a practically relevant machine learning task where each example is labelled with multiple labels simultaneously. Here, we analyze 40 MLC data sets by using 50 meta features describing different properties of the data. The main findings of this study are as follows. First, the most prominent meta features that describe the space of MLC data sets are the ones assessing different aspects of the label space. Second, the meta models show that the most important meta features describe the label space, and, the meta features describing the relationships among the labels tend to occur a bit more often than the meta features describing the distributions between and within the individual labels. Third, the optimization of the hyperparameters can improve the predictive performance, however, quite often the extent of the improvements does not always justify the resource utilization.
NEApr 24, 2021
OPTION: OPTImization Algorithm Benchmarking ONtologyAna Kostovska, Diederick Vermetten, Carola Doerr et al.
Many platforms for benchmarking optimization algorithms offer users the possibility of sharing their experimental data with the purpose of promoting reproducible and reusable research. However, different platforms use different data models and formats, which drastically inhibits identification of relevant data sets, their interpretation, and their interoperability. Consequently, a semantically rich, ontology-based, machine-readable data model is highly desired. We report in this paper on the development of such an ontology, which we name OPTION (OPTImization algorithm benchmarking ONtology). Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process, such as algorithms, problems, and evaluation measures. It also provides means for automated data integration, improved interoperability, powerful querying capabilities and reasoning, thereby enriching the value of the benchmark data. We demonstrate the utility of OPTION by annotating and querying a corpus of benchmark performance data from the BBOB workshop data - a use case which can be easily extended to cover other benchmarking data collections.
LGFeb 14, 2021
Comprehensive Comparative Study of Multi-Label Classification MethodsJasmin Bogatinovski, Ljupčo Todorovski, Sašo Džeroski et al.
Multi-label classification (MLC) has recently received increasing interest from the machine learning community. Several studies provide reviews of methods and datasets for MLC and a few provide empirical comparisons of MLC methods. However, they are limited in the number of methods and datasets considered. This work provides a comprehensive empirical study of a wide range of MLC methods on a plethora of datasets from various domains. More specifically, our study evaluates 26 methods on 42 benchmark datasets using 20 evaluation measures. The adopted evaluation methodology adheres to the highest literature standards for designing and executing large scale, time-budgeted experimental studies. First, the methods are selected based on their usage by the community, assuring representation of methods across the MLC taxonomy of methods and different base learners. Second, the datasets cover a wide range of complexity and domains of application. The selected evaluation measures assess the predictive performance and the efficiency of the methods. The results of the analysis identify RFPCT, RFDTBR, ECCJ48, EBRJ48 and AdaBoostMH as best performing methods across the spectrum of performance measures. Whenever a new method is introduced, it should be compared to different subsets of MLC methods, determined on the basis of the different evaluation criteria.
LGJan 23, 2021
ReliefE: Feature Ranking in High-dimensional Spaces via Manifold EmbeddingsBlaž Škrlj, Sašo Džeroski, Nada Lavrač et al.
Feature ranking has been widely adopted in machine learning applications such as high-throughput biology and social sciences. The approaches of the popular Relief family of algorithms assign importances to features by iteratively accounting for nearest relevant and irrelevant instances. Despite their high utility, these algorithms can be computationally expensive and not-well suited for high-dimensional sparse input spaces. In contrast, recent embedding-based methods learn compact, low-dimensional representations, potentially facilitating down-stream learning capabilities of conventional learners. This paper explores how the Relief branch of algorithms can be adapted to benefit from (Riemannian) manifold-based embeddings of instance and target spaces, where a given embedding's dimensionality is intrinsic to the dimensionality of the considered data set. The developed ReliefE algorithm is faster and can result in better feature rankings, as shown by our evaluation on 20 real-life data sets for multi-class and multi-label classification tasks. The utility of ReliefE for high-dimensional data sets is ensured by its implementation that utilizes sparse matrix algebraic operations. Finally, the relation of ReliefE to other ranking algorithms is studied via the Fuzzy Jaccard Index.
LGDec 1, 2020
Probabilistic Grammars for Equation DiscoveryJure Brence, Ljupčo Todorovski, Sašo Džeroski
Equation discovery, also known as symbolic regression, is a type of automated modeling that discovers scientific laws, expressed in the form of equations, from observed data and expert knowledge. Deterministic grammars, such as context-free grammars, have been used to limit the search spaces in equation discovery by providing hard constraints that specify which equations to consider and which not. In this paper, we propose the use of probabilistic context-free grammars in equation discovery. Such grammars encode soft constraints, specifying a prior probability distribution on the space of possible equations. We show that probabilistic grammars can be used to elegantly and flexibly formulate the parsimony principle, that favors simpler equations, through probabilities attached to the rules in the grammars. We demonstrate that the use of probabilistic, rather than deterministic grammars, in the context of a Monte-Carlo algorithm for grammar-based equation discovery, leads to more efficient equation discovery. Finally, by specifying prior probability distributions over equation spaces, the foundations are laid for Bayesian approaches to equation discovery.
LGNov 23, 2020
Ensemble- and Distance-Based Feature Ranking for Unsupervised LearningMatej Petković, Dragi Kocev, Blaž Škrlj et al.
In this work, we propose two novel (groups of) methods for unsupervised feature ranking and selection. The first group includes feature ranking scores (Genie3 score, RandomForest score) that are computed from ensembles of predictive clustering trees. The second method is URelief, the unsupervised extension of the Relief family of feature ranking algorithms. Using 26 benchmark data sets and 5 baselines, we show that both the Genie3 score (computed from the ensemble of extra trees) and the URelief method outperform the existing methods and that Genie3 performs best overall, in terms of predictive power of the top-ranked features. Additionally, we analyze the influence of the hyper-parameters of the proposed methods on their performance, and show that for the Genie3 score the highest quality is achieved by the most efficient parameter configuration. Finally, we propose a way of discovering the location of the features in the ranking, which are the most relevant in reality.
LGAug 10, 2020
Feature Ranking for Semi-supervised LearningMatej Petković, Sašo Džeroski, Dragi Kocev
The data made available for analysis are becoming more and more complex along several directions: high dimensionality, number of examples and the amount of labels per example. This poses a variety of challenges for the existing machine learning methods: coping with dataset with a large number of examples that are described in a high-dimensional space and not all examples have labels provided. For example, when investigating the toxicity of chemical compounds there are a lot of compounds available, that can be described with information rich high-dimensional representations, but not all of the compounds have information on their toxicity. To address these challenges, we propose semi-supervised learning of feature ranking. The feature rankings are learned in the context of classification and regression as well as in the context of structured output prediction (multi-label classification, hierarchical multi-label classification and multi-target regression). To the best of our knowledge, this is the first work that treats the task of feature ranking within the semi-supervised structured output prediction context. More specifically, we propose two approaches that are based on tree ensembles and the Relief family of algorithms. The extensive evaluation across 38 benchmark datasets reveals the following: Random Forests perform the best for the classification-like tasks, while for the regression-like tasks Extra-PCTs perform the best, Random Forests are the most efficient method considering induction times across all tasks, and semi-supervised feature rankings outperform their supervised counterpart across a majority of the datasets from the different tasks.
LGFeb 11, 2020
Feature Importance Estimation with Self-Attention NetworksBlaž Škrlj, Sašo Džeroski, Nada Lavrač et al.
Black-box neural network models are widely used in industry and science, yet are hard to understand and interpret. Recently, the attention mechanism was introduced, offering insights into the inner workings of neural language models. This paper explores the use of attention-based neural networks mechanism for estimating feature importance, as means for explaining the models learned from propositional (tabular) data. Feature importance estimates, assessed by the proposed Self-Attention Network (SAN) architecture, are compared with the established ReliefF, Mutual Information and Random Forest-based estimates, which are widely used in practice for model interpretation. For the first time we conduct scale-free comparisons of feature importance estimates across algorithms on ten real and synthetic data sets to study the similarities and differences of the resulting feature importance estimates, showing that SANs identify similar high-ranked features as the other methods. We demonstrate that SANs identify feature interactions which in some cases yield better predictive performance than the baselines, suggesting that attention extends beyond interactions of just a few key features and detects larger feature subsets relevant for the considered learning task.
LGJul 1, 2019
Equation Discovery for Nonlinear System IdentificationNikola Simidjievski, Ljupčo Todorovski, Juš Kocijan et al.
Equation discovery methods enable modelers to combine domain-specific knowledge and system identification to construct models most suitable for a selected modeling task. The method described and evaluated in this paper can be used as a nonlinear system identification method for gray-box modeling. It consists of two interlaced parts of modeling that are computer-aided. The first performs computer-aided identification of a model structure composed of elements selected from user-specified domain-specific modeling knowledge, while the second part performs parameter estimation. In this paper, recent developments of the equation discovery method called process-based modeling, suited for nonlinear system identification, are elaborated and illustrated on two continuous-time case studies. The first case study illustrates the use of the process-based modeling on synthetic data while the second case-study evaluates on measured data for a standard system-identification benchmark. The experimental results clearly demonstrate the ability of process-based modeling to reconstruct both model structure and parameters from measured data.
LGJun 21, 2019
Meta-Model Framework for Surrogate-Based Parameter Estimation in Dynamical SystemsŽiga Lukšič, Jovan Tanevski, Sašo Džeroski et al.
The central task in modeling complex dynamical systems is parameter estimation. This task involves numerous evaluations of a computationally expensive objective function. Surrogate-based optimization introduces a computationally efficient predictive model that approximates the value of the objective function. The standard approach involves learning a surrogate from training examples that correspond to past evaluations of the objective function. Current surrogate-based optimization methods use static, predefined substitution strategies that decide when to use the surrogate and when the true objective. We introduce a meta-model framework where the substitution strategy is dynamically adapted to the solution space of the given optimization problem. The meta model encapsulates the objective function, the surrogate model and the model of the substitution strategy, as well as components for learning them. The framework can be seamlessly coupled with an arbitrary optimization algorithm without any modification: it replaces the objective function and autonomously decides how to evaluate a given candidate solution. We test the utility of the framework on three tasks of estimating parameters of real-world models of dynamical systems. The results show that the meta model significantly improves the efficiency of optimization, reducing the total number of evaluations of the objective function up to an average of 77%.
LGSep 3, 2018
Machine learning for predicting thermal power consumption of the Mars Express SpacecraftMatej Petković, Redouane Boumghar, Martin Breskvar et al.
The thermal subsystem of the Mars Express (MEX) spacecraft keeps the on-board equipment within its pre-defined operating temperatures range. To plan and optimize the scientific operations of MEX, its operators need to estimate in advance, as accurately as possible, the power consumption of the thermal subsystem. The remaining power can then be allocated for scientific purposes. We present a machine learning pipeline for efficiently constructing accurate predictive models for predicting the power of the thermal subsystem on board MEX. In particular, we employ state-of-the-art feature engineering approaches for transforming raw telemetry data, in turn used for constructing accurate models with different state-of-the-art machine learning methods. We show that the proposed pipeline considerably improve our previous (competition-winning) work in terms of time efficiency and predictive performance. Moreover, while achieving superior predictive performance, the constructed models also provide important insight into the spacecraft's behavior, allowing for further analyses and optimal planning of MEX's operation.
QMFeb 20, 2017
Using Redescription Mining to Relate Clinical and Biological Characteristics of Cognitively Impaired and Alzheimer's Disease PatientsMatej Mihelčić, Goran Šimić, Mirjana Babić Leko et al.
We used redescription mining to find interpretable rules revealing associations between those determinants that provide insights about the Alzheimer's disease (AD). We extended the CLUS-RM redescription mining algorithm to a constraint-based redescription mining (CBRM) setting, which enables several modes of targeted exploration of specific, user-constrained associations. Redescription mining enabled finding specific constructs of clinical and biological attributes that describe many groups of subjects of different size, homogeneity and levels of cognitive impairment. We confirmed some previously known findings. However, in some instances, as with the attributes: testosterone, the imaging attribute Spatial Pattern of Abnormalities for Recognition of Early AD, as well as the levels of leptin and angiopoietin-2 in plasma, we corroborated previously debatable findings or provided additional information about these variables and their association with AD pathogenesis. Applying redescription mining on ADNI data resulted with the discovery of one largely unknown attribute: the Pregnancy-Associated Protein-A (PAPP-A), which we found highly associated with cognitive impairment in AD. Statistically significant correlations (p <= 0.01) were found between PAPP-A and various different clinical tests. The high importance of this finding lies in the fact that PAPP-A is a metalloproteinase, known to cleave insulin-like growth factor binding proteins. Since it also shares similar substrates with A Disintegrin and the Metalloproteinase family of enzymes that act as α-secretase to physiologically cleave amyloid precursor protein (APP) in the non-amyloidogenic pathway, it could be directly involved in the metabolism of APP very early during the disease course. Therefore, further studies should investigate the role of PAPP-A in the development of AD more thoroughly.
AIJun 13, 2016
A framework for redescription set constructionMatej Mihelčić, Sašo Džeroski, Nada Lavrač et al.
Redescription mining is a field of knowledge discovery that aims at finding different descriptions of similar subsets of instances in the data. These descriptions are represented as rules inferred from one or more disjoint sets of attributes, called views. As such, they support knowledge discovery process and help domain experts in formulating new hypotheses or constructing new knowledge bases and decision support systems. In contrast to previous approaches that typically create one smaller set of redescriptions satisfying a pre-defined set of constraints, we introduce a framework that creates large and heterogeneous redescription set from which user/expert can extract compact sets of differing properties, according to its own preferences. Construction of large and heterogeneous redescription set relies on CLUS-RM algorithm and a novel, conjunctive refinement procedure that facilitates generation of larger and more accurate redescription sets. The work also introduces the variability of redescription accuracy when missing values are present in the data, which significantly extends applicability of the method. Crucial part of the framework is the redescription set extraction based on heuristic multi-objective optimization procedure that allows user to define importance levels towards one or more redescription quality criteria. We provide both theoretical and empirical comparison of the novel framework against current state of the art redescription mining algorithms and show that it represents more efficient and versatile approach for mining redescriptions from data.