CYSep 26, 2024
Geospatial Road Cycling Race Results Data SetBram Janssens, Luca Pappalardo, Jelle De Bock et al.
The field of cycling analytics has only recently started to develop due to limited access to open data sources. Accordingly, research and data sources are very divergent, with large differences in information used across studies. To improve this, and facilitate further research in the field, we propose the publication of a data set which links thousands of professional race results from the period 2017-2023 to detailed geographic information about the courses, an essential aspect in road cycling analytics. Initial use cases are proposed, showcasing the usefulness in linking these two data sources.
LGSep 1, 2025
Evaluating the stability of model explanations in instance-dependent cost-sensitive credit scoringMatteo Ballegeer, Matthias Bogaert, Dries F. Benoit
Instance-dependent cost-sensitive (IDCS) classifiers offer a promising approach to improving cost-efficiency in credit scoring by tailoring loss functions to instance-specific costs. However, the impact of such loss functions on the stability of model explanations remains unexplored in literature, despite increasing regulatory demands for transparency. This study addresses this gap by evaluating the stability of Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) when applied to IDCS models. Using four publicly available credit scoring datasets, we first assess the discriminatory power and cost-efficiency of IDCS classifiers, introducing a novel metric to enhance cross-dataset comparability. We then investigate the stability of SHAP and LIME feature importance rankings under varying degrees of class imbalance through controlled resampling. Our results reveal that while IDCS classifiers improve cost-efficiency, they produce significantly less stable explanations compared to traditional models, particularly as class imbalance increases, highlighting a critical trade-off between cost optimization and interpretability in credit scoring. Amid increasing regulatory scrutiny on explainability, this research underscores the pressing need to address stability issues in IDCS classifiers to ensure that their cost advantages are not undermined by unstable or untrustworthy explanations.
CLSep 18, 2025
LLM-Assisted Topic Reduction for BERTopic on Social Media DataWannes Janssens, Matthias Bogaert, Dirk Van den Poel
The BERTopic framework leverages transformer embeddings and hierarchical clustering to extract latent topics from unstructured text corpora. While effective, it often struggles with social media data, which tends to be noisy and sparse, resulting in an excessive number of overlapping topics. Recent work explored the use of large language models for end-to-end topic modelling. However, these approaches typically require significant computational overhead, limiting their scalability in big data contexts. In this work, we propose a framework that combines BERTopic for topic generation with large language models for topic reduction. The method first generates an initial set of topics and constructs a representation for each. These representations are then provided as input to the language model, which iteratively identifies and merges semantically similar topics. We evaluate the approach across three Twitter/X datasets and four different language models. Our method outperforms the baseline approach in enhancing topic diversity and, in many cases, coherence, with some sensitivity to dataset characteristics and initial parameter selection.
GNJun 28, 2025
Predicting and Explaining Customer Data Sharing in the Open BankingJoão B. G. de Brito, Rodrigo Heldt, Cleo S. Silveira et al.
The emergence of Open Banking represents a significant shift in financial data management, influencing financial institutions' market dynamics and marketing strategies. This increased competition creates opportunities and challenges, as institutions manage data inflow to improve products and services while mitigating data outflow that could aid competitors. This study introduces a framework to predict customers' propensity to share data via Open Banking and interprets this behavior through Explanatory Model Analysis (EMA). Using data from a large Brazilian financial institution with approximately 3.2 million customers, a hybrid data balancing strategy incorporating ADASYN and NEARMISS techniques was employed to address the infrequency of data sharing and enhance the training of XGBoost models. These models accurately predicted customer data sharing, achieving 91.39% accuracy for inflow and 91.53% for outflow. The EMA phase combined the Shapley Additive Explanations (SHAP) method with the Classification and Regression Tree (CART) technique, revealing the most influential features on customer decisions. Key features included the number of transactions and purchases in mobile channels, interactions within these channels, and credit-related features, particularly credit card usage across the national banking system. These results highlight the critical role of mobile engagement and credit in driving customer data-sharing behaviors, providing financial institutions with strategic insights to enhance competitiveness and innovation in the Open Banking environment.
LGMay 17, 2023
Bike2Vec: Vector Embedding Representations of Road Cycling Riders and RacesEthan Baron, Bram Janssens, Matthias Bogaert
Vector embeddings have been successfully applied in several domains to obtain effective representations of non-numeric data which can then be used in various downstream tasks. We present a novel application of vector embeddings in professional road cycling by demonstrating a method to learn representations for riders and races based on historical results. We use unsupervised learning techniques to validate that the resultant embeddings capture interesting features of riders and races. These embeddings could be used for downstream prediction tasks such as early talent identification and race outcome prediction.