LGOct 25, 2022
Predicting Survival Outcomes in the Presence of Unlabeled DataFateme Nateghi Haredasht, Celine Vens
Many clinical studies require the follow-up of patients over time. This is challenging: apart from frequently observed drop-out, there are often also organizational and financial challenges, which can lead to reduced data collection and, in turn, can complicate subsequent analyses. In contrast, there is often plenty of baseline data available of patients with similar characteristics and background information, e.g., from patients that fall outside the study time window. In this article, we investigate whether we can benefit from the inclusion of such unlabeled data instances to predict accurate survival times. In other words, we introduce a third level of supervision in the context of survival analysis, apart from fully observed and censored instances, we also include unlabeled instances. We propose three approaches to deal with this novel setting and provide an empirical comparison over fifteen real-life clinical and gene expression survival datasets. Our results demonstrate that all approaches are able to increase the predictive performance over independent test data. We also show that integrating the partial supervision provided by censored data in a semi-supervised wrapper approach generally provides the best results, often achieving high improvements, compared to not using unlabeled data.
LGMar 29, 2022
BELLATREX: Building Explanations through a LocaLly AccuraTe Rule EXtractorKlest Dedja, Felipe Kenji Nakano, Konstantinos Pliakos et al.
Tree-ensemble algorithms, such as random forest, are effective machine learning methods popular for their flexibility, high performance, and robustness to overfitting. However, since multiple learners are combined, they are not as interpretable as a single decision tree. In this work we propose a novel method that is Building Explanations through a LocalLy AccuraTe Rule EXtractor (Bellatrex), and is able to explain the forest prediction for a given test instance with only a few diverse rules. Starting from the decision trees generated by a random forest, our method 1) pre-selects a subset of the rules used to make the prediction, 2) creates a vector representation of such rules, 3) projects them to a low-dimensional space, 4) clusters such representations to pick a rule from each cluster to explain the instance prediction. We test the effectiveness of Bellatrex on 89 real-world datasets and we demonstrate the validity of our method for binary classification, regression, multi-label classification and time-to-event tasks. To the best of our knowledge, it is the first time that an interpretability toolbox can handle all these tasks within the same framework. We also show that our extracted surrogate model can approximate the performance of the corresponding ensemble model in all considered tasks, while selecting only few trees from the whole forest. We also show that our proposed approach substantially outperforms other explainable methods in terms of predictive performance.
LGJul 13, 2022
Hierarchy exploitation to detect missing annotations on hierarchical multi-label classificationMiguel Romero, Felipe Kenji Nakano, Jorge Finke et al.
The availability of genomic data has grown exponentially in the last decade, mainly due to the development of new sequencing technologies. Based on the interactions between genes (and gene products) extracted from the increasing genomic data, numerous studies have focused on the identification of associations between genes and functions. While these studies have shown great promise, the problem of annotating genes with functions remains an open challenge. In this work, we present a method to detect missing annotations in hierarchical multi-label classification datasets. We propose a method that exploits the class hierarchy by computing aggregated probabilities to the paths of classes from the leaves to the root for each instance. The proposed method is presented in the context of predicting missing gene function annotations, where these aggregated probabilities are further used to select a set of annotations to be verified through in vivo experiments. The experiments on Oriza sativa Japonica, a variety of rice, showcase that incorporating the hierarchy of classes into the method often improves the predictive performance and our proposed method yields superior results when compared to competitor methods from the literature.
IROct 31, 2025
Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start RecommendationAlireza Gharahighehi, Felipe Kenji Nakano, Xuehua Yang et al.
Recommender systems (RSs) are intelligent filtering methods that suggest items to users based on their inferred preferences, derived from their interaction history on the platform. Collaborative filtering-based RSs rely on users past interactions to generate recommendations. However, when a user is new to the platform, referred to as a cold-start user, there is no historical data available, making it difficult to provide personalized recommendations. To address this, rating elicitation techniques can be used to gather initial ratings or preferences on selected items, helping to build an early understanding of the user's tastes. Rating elicitation approaches are generally categorized into two types: non-personalized and personalized. Decision tree-based rating elicitation is a personalized method that queries users about their preferences at each node of the tree until sufficient information is gathered. In this paper, we propose an extension to the decision tree approach for rating elicitation in the context of music recommendation. Our method: (i) elicits not only item ratings but also preferences on attributes such as genres to better cluster users, and (ii) uses item pairs instead of single items at each node to more effectively learn user preferences. Experimental results demonstrate that both proposed enhancements lead to improved performance, particularly with a reduced number of queries.
IRJun 22, 2023
HypeRS: Building a Hypergraph-driven ensemble Recommender SystemAlireza Gharahighehi, Celine Vens, Konstantinos Pliakos
Recommender systems are designed to predict user preferences over collections of items. These systems process users' previous interactions to decide which items should be ranked higher to satisfy their desires. An ensemble recommender system can achieve great recommendation performance by effectively combining the decisions generated by individual models. In this paper, we propose a novel ensemble recommender system that combines predictions made by different models into a unified hypergraph ranking framework. This is the first time that hypergraph ranking has been employed to model an ensemble of recommender systems. Hypergraphs are generalizations of graphs where multiple vertices can be connected via hyperedges, efficiently modeling high-order relations. We differentiate real and predicted connections between users and items by assigning different hyperedge weights to individual recommender systems. We perform experiments using four datasets from the fields of movie, music and news media recommendation. The obtained results show that the ensemble hypergraph ranking method generates more accurate recommendations compared to the individual models and a weighted hybrid approach. The assignment of different hyperedge weights to the ensemble hypergraph further improves the performance compared to a setting with identical hyperedge weights.
LGNov 16, 2025
Oxytrees: Model Trees for Bipartite LearningPedro Ilídio, Felipe Kenji Nakano, Alireza Gharahighehi et al.
Bipartite learning is a machine learning task that aims to predict interactions between pairs of instances. It has been applied to various domains, including drug-target interactions, RNA-disease associations, and regulatory network inference. Despite being widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices, significantly reducing training time without compromising predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees with that of the current state-of-the-art. We achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests, while demonstrating competitive or superior performance in most evaluation settings, particularly in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.
CYFeb 27, 2025
Enhancing Collaborative Filtering-Based Course Recommendations by Exploiting Time-to-Event Information with Survival AnalysisAlireza Gharahighehi, Achilleas Ghinis, Michela Venturini et al.
Massive Open Online Courses (MOOCs) are emerging as a popular alternative to traditional education, offering learners the flexibility to access a wide range of courses from various disciplines, anytime and anywhere. Despite this accessibility, a significant number of enrollments in MOOCs result in dropouts. To enhance learner engagement, it is crucial to recommend courses that align with their preferences and needs. Course Recommender Systems (RSs) can play an important role in this by modeling learners' preferences based on their previous interactions within the MOOC platform. Time-to-dropout and time-to-completion in MOOCs, like other time-to-event prediction tasks, can be effectively modeled using survival analysis (SA) methods. In this study, we apply SA methods to improve collaborative filtering recommendation performance by considering time-to-event in the context of MOOCs. Our proposed approach demonstrates superior performance compared to collaborative filtering methods trained based on learners' interactions with MOOCs, as evidenced by two performance measures on three publicly available datasets. The findings underscore the potential of integrating SA methods with RSs to enhance personalization in MOOCs.
LGNov 1, 2024
Label Cluster Chains for Multi-Label ClassificationElaine Cecília Gatto, Felipe Nakano Kenji, Jesse Read et al.
Multi-label classification is a type of supervised machine learning that can simultaneously assign multiple labels to an instance. To solve this task, some methods divide the original problem into several sub-problems (local approach), others learn all labels at once (global approach), and others combine several classifiers (ensemble approach). Regardless of the approach used, exploring and learning label correlations is important to improve the classifier predictions. Ensemble of Classifier Chains (ECC) is a well-known multi-label method that considers label correlations and can achieve good overall performance on several multi-label datasets and evaluation measures. However, one of the challenges when working with ECC is the high dimensionality of the label space, which can impose limitations for fully-cascaded chains as the complexity increases regarding feature space expansion. To improve classifier chains, we propose a method to chain disjoint correlated label clusters obtained by applying a partition method in the label space. During the training phase, the ground truth labels of each cluster are used as new features for all of the following clusters. During the test phase, the predicted labels of clusters are used as new features for all the following clusters. Our proposal, called Label Cluster Chains for Multi-Label Classification (LCC-ML), uses multi-label Random Forests as base classifiers in each cluster, combining their predictions to obtain a final multi-label classification. Our proposal obtained better results compared to the original ECC. This shows that learning and chaining disjoint correlated label clusters can better explore and learn label correlations.
IRFeb 5, 2021
Diversification in Session-based News Recommender SystemsAlireza Gharahighehi, Celine Vens
Recommender systems are widely applied in digital platforms such as news websites to personalize services based on user preferences. In news websites most of users are anonymous and the only available data is sequences of items in anonymous sessions. Due to this, typical collaborative filtering methods, which are highly applied in many applications, are not effective in news recommendations. In this context, session-based recommenders are able to recommend next items given the sequence of previous items in the active session. Neighborhood-based session-based recommenders has been shown to be highly effective compared to more sophisticated approaches. In this study we propose scenarios to make these session-based recommender systems diversity-aware and to address the filter bubble phenomenon. The filter bubble phenomenon is a common concern in news recommendation systems and it occurs when the system narrows the information and deprives users of diverse information. The results of applying the proposed scenarios show that these diversification scenarios improve the diversity measures in these session-based recommender systems based on four news datasets.
LGDec 22, 2020
Drug-Target Interaction Prediction via an Ensemble of Weighted Nearest Neighbors with Interaction RecoveryBin Liu, Konstantinos Pliakos, Celine Vens et al.
Predicting drug-target interactions (DTI) via reliable computational methods is an effective and efficient way to mitigate the enormous costs and time of the drug discovery process. Structure-based drug similarities and sequence-based target protein similarities are the commonly used information for DTI prediction. Among numerous computational methods, neighborhood-based chemogenomic approaches that leverage drug and target similarities to perform predictions directly are simple but promising ones. However, existing similarity-based methods need to be re-trained to predict interactions for any new drugs or targets and cannot directly perform predictions for both new drugs, new targets, and new drug-target pairs. Furthermore, a large amount of missing (undetected) interactions in current DTI datasets hinders most DTI prediction methods. To address these issues, we propose a new method denoted as Weighted k-Nearest Neighbor with Interaction Recovery (WkNNIR). Not only can WkNNIR estimate interactions of any new drugs and/or new targets without any need of re-training, but it can also recover missing interactions (false negatives). In addition, WkNNIR exploits local imbalance to promote the influence of more reliable similarities on the interaction recovery and prediction processes. We also propose a series of ensemble methods that employ diverse sampling strategies and could be coupled with WkNNIR as well as any other DTI prediction method to improve performance. Experimental results over five benchmark datasets demonstrate the effectiveness of our approaches in predicting drug-target interactions. Lastly, we confirm the practical prediction ability of proposed methods to discover reliable interactions that were not reported in the original benchmark datasets.
IRDec 1, 2020
Fair Multi-Stakeholder News Recommender System with Hypergraph rankingAlireza Gharahighehi, Celine Vens, Konstantinos Pliakos
Recommender systems are typically designed to fulfill end user needs. However, in some domains the users are not the only stakeholders in the system. For instance, in a news aggregator website users, authors, magazines as well as the platform itself are potential stakeholders. Most of the collaborative filtering recommender systems suffer from popularity bias. Therefore, if the recommender system only considers users' preferences, presumably it over-represents popular providers and under-represents less popular providers. To address this issue one should consider other stakeholders in the generated ranked lists. In this paper we demonstrate that hypergraph learning has the natural capability of handling a multi-stakeholder recommendation task. A hypergraph can model high order relations between different types of objects and therefore is naturally inclined to generate recommendation lists considering multiple stakeholders. We form the recommendations in time-wise rounds and learn to adapt the weights of stakeholders to increase the coverage of low-covered stakeholders over time. The results show that the proposed approach counters popularity bias and produces fairer recommendations with respect to authors in two news datasets, at a low cost in precision.
LGNov 3, 2020
Deep tree-ensembles for multi-output predictionFelipe Kenji Nakano, Konstantinos Pliakos, Celine Vens
Recently, deep neural networks have expanded the state-of-art in various scientific fields and provided solutions to long standing problems across multiple application domains. Nevertheless, they also suffer from weaknesses since their optimal performance depends on massive amounts of training data and the tuning of an extended number of parameters. As a countermeasure, some deep-forest methods have been recently proposed, as efficient and low-scale solutions. Despite that, these approaches simply employ label classification probabilities as induced features and primarily focus on traditional classification and regression tasks, leaving multi-output prediction under-explored. Moreover, recent work has demonstrated that tree-embeddings are highly representative, especially in structured output prediction. In this direction, we propose a novel deep tree-ensemble (DTE) model, where every layer enriches the original feature set with a representation learning component based on tree-embeddings. In this paper, we specifically focus on two structured output prediction tasks, namely multi-label classification and multi-target regression. We conducted experiments using multiple benchmark datasets and the obtained results confirm that our method provides superior results to state-of-the-art methods in both tasks.