LGJan 19, 2023
A Machine Learning Approach for Player and Position Adjusted Expected Goals in Football (Soccer)James H. Hewitt, Oktay Karakuş
Football is a very result-driven industry, with goals being rarer than in most sports, so having further parameters to judge the performance of teams and individuals is key. Expected Goals (xG) allow further insight than just a scoreline. To tackle the need for further analysis in football, this paper uses machine learning applications that are developed and applied to Football Event data. From the concept, a Binary Classification problem is created whereby a probabilistic valuation is outputted using Logistic Regression and Gradient Boosting based approaches. The model successfully predicts xGs probability values for football players based on 15,575 shots. The proposed solution utilises StatsBomb as the data provider and an industry benchmark to tune the models in the right direction. The proposed ML solution for xG is further used to tackle the age-old cliche of: 'the ball has fallen to the wrong guy there'. The development of the model is used to adjust and gain more realistic values of expected goals than the general models show. To achieve this, this paper tackles Positional Adjusted xG, splitting the training data into Forward, Midfield, and Defence with the aim of providing insight into player qualities based on their positional sub-group. Positional Adjusted xG successfully predicts and proves that more attacking players are better at accumulating xG. The highest value belonged to Forwards followed by Midfielders and Defenders. Finally, this study has further developments into Player Adjusted xG with the aim of proving that Messi is statistically at a higher efficiency level than the average footballer. This is achieved by using Messi subset samples to quantify his qualities in comparison to the average xG models finding that Messi xG performs 347 xG higher than the general model outcome.
LGNov 1, 2022
Predicting air quality via multimodal AI and satellite imageryAndrew Rowley, Oktay Karakuş
Climate change may be classified as the most important environmental problem that the Earth is currently facing, and affects all living species on Earth. Given that air-quality monitoring stations are typically ground-based their abilities to detect pollutant distributions are often restricted to wide areas. Satellites however have the potential for studying the atmosphere at large; the European Space Agency (ESA) Copernicus project satellite, "Sentinel-5P" is a newly launched satellite capable of measuring a variety of pollutant information with publicly available data outputs. This paper seeks to create a multi-modal machine learning model for predicting air-quality metrics where monitoring stations do not exist. The inputs of this model will include a fusion of ground measurements and satellite data with the goal of highlighting pollutant distribution and motivating change in societal and industrial behaviors. A new dataset of European pollution monitoring station measurements is created with features including $\textit{altitude, population, etc.}$ from the ESA Copernicus project. This dataset is used to train a multi-modal ML model, Air Quality Network (AQNet) capable of fusing these various types of data sources to output predictions of various pollutants. These predictions are then aggregated to create an "air-quality index" that could be used to compare air quality over different regions. Three pollutants, NO$_2$, O$_3$, and PM$_{10}$, are predicted successfully by AQNet and the network was found to be useful compared to a model only using satellite imagery. It was also found that the addition of supporting data improves predictions. When testing the developed AQNet on out-of-sample data of the UK and Ireland, we obtain satisfactory estimates though on average pollution metrics were roughly overestimated by around 20\%.
CLJan 19, 2023
Sentiment Analysis for Measuring Hope and Fear from Reddit Posts During the 2022 Russo-Ukrainian ConflictAlessio Guerra, Oktay Karakuş
This paper proposes a novel lexicon-based unsupervised sentimental analysis method to measure the $``\textit{hope}"$ and $``\textit{fear}"$ for the 2022 Ukrainian-Russian Conflict. $\textit{Reddit.com}$ is utilised as the main source of human reactions to daily events during nearly the first three months of the conflict. The top 50 $``hot"$ posts of six different subreddits about Ukraine and news (Ukraine, worldnews, Ukraina, UkrainianConflict, UkraineWarVideoReport, UkraineWarReports) and their relative comments are scraped and a data set is created. On this corpus, multiple analyses such as (1) public interest, (2) hope/fear score, (3) stock price interaction are employed. We promote using a dictionary approach, which scores the hopefulness of every submitted user post. The Latent Dirichlet Allocation (LDA) algorithm of topic modelling is also utilised to understand the main issues raised by users and what are the key talking points. Experimental analysis shows that the hope strongly decreases after the symbolic and strategic losses of Azovstal (Mariupol) and Severodonetsk. Spikes in hope/fear, both positives and negatives, are present after important battles, but also some non-military events, such as Eurovision and football games.
APNov 22, 2023
Bayes-xG: Player and Position Correction on Expected Goals (xG) using Bayesian Hierarchical ApproachAlexander Scholtes, Oktay Karakuş
This study employs Bayesian methodologies to explore the influence of player or positional factors in predicting the probability of a shot resulting in a goal, measured by the expected goals (xG) metric. Utilising publicly available data from StatsBomb, Bayesian hierarchical logistic regressions are constructed, analysing approximately 10,000 shots from the English Premier League to ascertain whether positional or player-level effects impact xG. The findings reveal positional effects in a basic model that includes only distance to goal and shot angle as predictors, highlighting that strikers and attacking midfielders exhibit a higher likelihood of scoring. However, these effects diminish when more informative predictors are introduced. Nevertheless, even with additional predictors, player-level effects persist, indicating that certain players possess notable positive or negative xG adjustments, influencing their likelihood of scoring a given chance. The study extends its analysis to data from Spain's La Liga and Germany's Bundesliga, yielding comparable results. Additionally, the paper assesses the impact of prior distribution choices on outcomes, concluding that the priors employed in the models provide sound results but could be refined to enhance sampling efficiency for constructing more complex and extensive models feasibly.
SPFeb 23
The Sim-to-Real Gap in MRS Quantification: A Systematic Deep Learning Validation for GABAZien Ma, S. M. Shermer, Oktay Karakuş et al.
Magnetic resonance spectroscopy (MRS) is used to quantify metabolites in vivo and estimate biomarkers for conditions ranging from neurological disorders to cancers. Quantifying low-concentration metabolites such as GABA ($γ$-aminobutyric acid) is challenging due to low signal-to-noise ratio (SNR) and spectral overlap. We investigate and validate deep learning for quantifying complex, low-SNR, overlapping signals from MEGA-PRESS spectra, devise a convolutional neural network (CNN) and a Y-shaped autoencoder (YAE), and select the best models via Bayesian optimisation on 10,000 simulated spectra from slice-profile-aware MEGA-PRESS simulations. The selected models are trained on 100,000 simulated spectra. We validate their performance on 144 spectra from 112 experimental phantoms containing five metabolites of interest (GABA, Glu, Gln, NAA, Cr) with known ground truth concentrations across solution and gel series acquired at 3 T under varied bandwidths and implementations. These models are further assessed against the widely used LCModel quantification tool. On simulations, both models achieve near-perfect agreement (small MAEs; regression slopes $\approx 1.00$, $R^2 \approx 1.00$). On experimental phantom data, errors initially increased substantially. However, modelling variable linewidths in the training data significantly reduced this gap. The best augmented deep learning models achieved a mean MAE for GABA over all phantom spectra of 0.151 (YAE) and 0.160 (FCNN) in max-normalised relative concentrations, outperforming the conventional baseline LCModel (0.220). A sim-to-real gap remains, but physics-informed data augmentation substantially reduced it. Phantom ground truth is needed to judge whether a method will perform reliably on real data.
CVMay 2, 2025Code
Core-Set Selection for Data-efficient Land Cover SegmentationKeiller Nogueira, Akram Zaytar, Wanli Ma et al.
The increasing accessibility of remotely sensed data and the potential of such data to inform large-scale decision-making has driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models must be trained on large datasets. However, the common assumption that broadly larger datasets lead to better outcomes tends to overlook the complexities of the data distribution, the potential for introducing biases and noise, and the computational resources required for processing and storing vast datasets. Therefore, effective solutions should consider both the quantity and quality of data. In this paper, we propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets that rely on imagery only, labels only, and a combination of each. We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets: DFC2022, Vaihingen, and Potsdam. In each of the datasets, we demonstrate that training on a subset of samples outperforms the random baseline, and some approaches outperform training on all available data. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.
IVJan 4Code
UniCrop: A Universal, Multi-Source Data Engineering Pipeline for Scalable Crop Yield PredictionEmiliya Khidirova, Oktay Karakuş
Accurate crop yield prediction relies on diverse data streams, including satellite, meteorological, soil, and topographic information. However, despite rapid advances in machine learning, existing approaches remain crop- or region-specific and require data engineering efforts. This limits scalability, reproducibility, and operational deployment. This study introduces UniCrop, a universal and reusable data pipeline designed to automate the acquisition, cleaning, harmonisation, and engineering of multi-source environmental data for crop yield prediction. For any given location, crop type, and temporal window, UniCrop automatically retrieves, harmonises, and engineers over 200 environmental variables (Sentinel-1/2, MODIS, ERA5-Land, NASA POWER, SoilGrids, and SRTM), reducing them to a compact, analysis-ready feature set utilising a structured feature reduction workflow with minimum redundancy maximum relevance (mRMR). To validate, UniCrop was applied to a rice yield dataset comprising 557 field observations. Using only the selected 15 features, four baseline machine learning models (LightGBM, Random Forest, Support Vector Regression, and Elastic Net) were trained. LightGBM achieved the best single-model performance (RMSE = 465.1 kg/ha, $R^2 = 0.6576$), while a constrained ensemble of all baselines further improved accuracy (RMSE = 463.2 kg/ha, $R^2 = 0.6604$). UniCrop contributes a scalable and transparent data-engineering framework that addresses the primary bottleneck in operational crop yield modelling: the preparation of consistent and harmonised multi-source data. By decoupling data specification from implementation and supporting any crop, region, and time frame through simple configuration updates, UniCrop provides a practical foundation for scalable agricultural analytics. The code and implementation documentation are shared in https://github.com/CoDIS-Lab/UniCrop.
IVFeb 9, 2025
Multi-modal Data Fusion and Deep Ensemble Learning for Accurate Crop Yield PredictionAkshay Dagadu Yewle, Laman Mirzayeva, Oktay Karakuş
This study introduces RicEns-Net, a novel Deep Ensemble model designed to predict crop yields by integrating diverse data sources through multimodal data fusion techniques. The research focuses specifically on the use of synthetic aperture radar (SAR), optical remote sensing data from Sentinel 1, 2, and 3 satellites, and meteorological measurements such as surface temperature and rainfall. The initial field data for the study were acquired through Ernst & Young's (EY) Open Science Challenge 2023. The primary objective is to enhance the precision of crop yield prediction by developing a machine-learning framework capable of handling complex environmental data. A comprehensive data engineering process was employed to select the most informative features from over 100 potential predictors, reducing the set to 15 features from 5 distinct modalities. This step mitigates the ``curse of dimensionality" and enhances model performance. The RicEns-Net architecture combines multiple machine learning algorithms in a deep ensemble framework, integrating the strengths of each technique to improve predictive accuracy. Experimental results demonstrate that RicEns-Net achieves a mean absolute error (MAE) of 341 kg/Ha (roughly corresponds to 5-6\% of the lowest average yield in the region), significantly exceeding the performance of previous state-of-the-art models, including those developed during the EY challenge.
SIFeb 13
Semantic Communities and Boundary-Spanning Lyrics in K-pop: A Graph-Based Unsupervised AnalysisOktay Karakuş
Large-scale lyric corpora present unique challenges for data-driven analysis, including the absence of reliable annotations, multilingual content, and high levels of stylistic repetition. Most existing approaches rely on supervised classification, genre labels, or coarse document-level representations, limiting their ability to uncover latent semantic structure. We present a graph-based framework for unsupervised discovery and evaluation of semantic communities in K-pop lyrics using line-level semantic representations. By constructing a similarity graph over lyric texts and applying community detection, we uncover stable micro-theme communities without genre, artist, or language supervision. We further identify boundary-spanning songs via graph-theoretic bridge metrics and analyse their structural properties. Across multiple robustness settings, boundary-spanning lyrics exhibit higher lexical diversity and lower repetition compared to core community members, challenging the assumption that hook intensity or repetition drives cross-theme connectivity. Our framework is language-agnostic and applicable to unlabeled cultural text corpora.
SPNov 28, 2025
What If They Took the Shot? A Hierarchical Bayesian Framework for Counterfactual Expected GoalsMikayil Mahmudlu, Oktay Karakuş, Hasan Arkadaş
This study develops a hierarchical Bayesian framework that integrates expert domain knowledge to quantify player-specific effects in expected goals (xG) estimation, addressing a limitation of standard models that treat all players as identical finishers. Using 9,970 shots from StatsBomb's 2015-16 data and Football Manager 2017 ratings, we combine Bayesian logistic regression with informed priors to stabilise player-level estimates, especially for players with few shots. The hierarchical model reduces posterior uncertainty relative to weak priors and achieves strong external validity: hierarchical and baseline predictions correlate at R2 = 0.75, while an XGBoost benchmark validated against StatsBomb xG reaches R2 = 0.833. The model uncovers interpretable specialisation profiles, including one-on-one finishing (Aguero, Suarez, Belotti, Immobile, Martial), long-range shooting (Pogba), and first-touch execution (Insigne, Salah, Gameiro). It also identifies latent ability in underperforming players such as Immobile and Belotti. The framework supports counterfactual "what-if" analysis by reallocating shots between players under identical contexts. Case studies show that Sansone would generate +2.2 xG from Berardi's chances, driven largely by high-pressure situations, while Vardy-Giroud substitutions reveal strong asymmetry: replacing Vardy with Giroud results in a large decline (about -7 xG), whereas the reverse substitution has only a small effect (about -1 xG). This work provides an uncertainty-aware tool for player evaluation, recruitment, and tactical planning, and offers a general approach for domains where individual skill and contextual factors jointly shape performance.
LGOct 2, 2025
Adaptive Heterogeneous Mixtures of Normalising Flows for Robust Variational InferenceBenjamin Wiriyapong, Oktay Karakuş, Kirill Sidorov
Normalising-flow variational inference (VI) can approximate complex posteriors, yet single-flow models often behave inconsistently across qualitatively different distributions. We propose Adaptive Mixture Flow Variational Inference (AMF-VI), a heterogeneous mixture of complementary flows (MAF, RealNVP, RBIG) trained in two stages: (i) sequential expert training of individual flows, and (ii) adaptive global weight estimation via likelihood-driven updates, without per-sample gating or architectural changes. Evaluated on six canonical posterior families of banana, X-shape, two-moons, rings, a bimodal, and a five-mode mixture, AMF-VI achieves consistently lower negative log-likelihood than each single-flow baseline and delivers stable gains in transport metrics (Wasserstein-2) and maximum mean discrepancy (MDD), indicating improved robustness across shapes and modalities. The procedure is efficient and architecture-agnostic, incurring minimal overhead relative to standard flow training, and demonstrates that adaptive mixtures of diverse flows provide a reliable route to robust VI across diverse posterior families whilst preserving each expert's inductive bias.
SPJun 10, 2025
A Multi-Modal Spatial Risk Framework for EV Charging Infrastructure Using Remote SensingOktay Karakuş, Padraig Corcoran
Electric vehicle (EV) charging infrastructure is increasingly critical to sustainable transport systems, yet its resilience under environmental and infrastructural stress remains underexplored. In this paper, we introduce RSERI-EV, a spatially explicit and multi-modal risk assessment framework that combines remote sensing data, open infrastructure datasets, and spatial graph analytics to evaluate the vulnerability of EV charging stations. RSERI-EV integrates diverse data layers, including flood risk maps, land surface temperature (LST) extremes, vegetation indices (NDVI), land use/land cover (LULC), proximity to electrical substations, and road accessibility to generate a composite Resilience Score. We apply this framework to the country of Wales EV charger dataset to demonstrate its feasibility. A spatial $k$-nearest neighbours ($k$NN) graph is constructed over the charging network to enable neighbourhood-based comparisons and graph-aware diagnostics. Our prototype highlights the value of multi-source data fusion and interpretable spatial reasoning in supporting climate-resilient, infrastructure-aware EV deployment.
SPJan 11, 2025
TopoFormer: Integrating Transformers and ConvLSTMs for Coastal Topography PredictionSantosh Munian, Oktay Karakuş, William Russell et al.
This paper presents \textit{TopoFormer}, a novel hybrid deep learning architecture that integrates transformer-based encoders with convolutional long short-term memory (ConvLSTM) layers for the precise prediction of topographic beach profiles referenced to elevation datums, with a particular focus on Mean Low Water Springs (MLWS) and Mean Low Water Neaps (MLWN). Accurate topographic estimation down to MLWS is critical for coastal management, navigation safety, and environmental monitoring. Leveraging a comprehensive dataset from the Wales Coastal Monitoring Centre (WCMC), consisting of over 2000 surveys across 36 coastal survey units, TopoFormer addresses key challenges in topographic prediction, including temporal variability and data gaps in survey measurements. The architecture uniquely combines multi-head attention mechanisms and ConvLSTM layers to capture both long-range dependencies and localized temporal patterns inherent in beach profiles data. TopoFormer's predictive performance was rigorously evaluated against state-of-the-art models, including DenseNet, 1D/2D CNNs, and LSTMs. While all models demonstrated strong performance, \textit{TopoFormer} achieved the lowest mean absolute error (MAE), as low as 2 cm, and provided superior accuracy in both in-distribution (ID) and out-of-distribution (OOD) evaluations.
IVAug 8, 2020
Representation Learning via Cauchy Convolutional Sparse CodingPerla Mayo, Oktay Karakuş, Robin Holmes et al.
In representation learning, Convolutional Sparse Coding (CSC) enables unsupervised learning of features by jointly optimising both an \(\ell_2\)-norm fidelity term and a sparsity enforcing penalty. This work investigates using a regularisation term derived from an assumed Cauchy prior for the coefficients of the feature maps of a CSC generative model. The sparsity penalty term resulting from this prior is solved via its proximal operator, which is then applied iteratively, element-wise, on the coefficients of the feature maps to optimise the CSC cost function. The performance of the proposed Iterative Cauchy Thresholding (ICT) algorithm in reconstructing natural images is compared against the common choice of \(\ell_1\)-norm optimised via soft and hard thresholding. ICT outperforms IHT and IST in most of these reconstruction experiments across various datasets, with an average PSNR of up to 11.30 and 7.04 above ISTA and IHT respectively.
IVMay 6, 2020
Detection of Line Artefacts in Lung Ultrasound Images of COVID-19 Patients via Non-Convex RegularizationOktay Karakuş, Nantheera Anantrasirichai, Amazigh Aguersif et al.
In this paper, we present a novel method for line artefacts quantification in lung ultrasound (LUS) images of COVID-19 patients. We formulate this as a non-convex regularisation problem involving a sparsity-enforcing, Cauchy-based penalty function, and the inverse Radon transform. We employ a simple local maxima detection technique in the Radon transform domain, associated with known clinical definitions of line artefacts. Despite being non-convex, the proposed technique is guaranteed to convergence through our proposed Cauchy proximal splitting (CPS) method and accurately identifies both horizontal and vertical line artefacts in LUS images. In order to reduce the number of false and missed detection, our method includes a two-stage validation mechanism, which is performed in both Radon and image domains. We evaluate the performance of the proposed method in comparison to the current state-of-the-art B-line identification method and show a considerable performance gain with 87% correctly detected B-lines in LUS images of nine COVID-19 patients. In addition, owing to its fast convergence, our proposed method is readily applicable for processing LUS image sequences.