Marcell T. Kurbucz

LG
h-index13
13papers
38citations
Novelty43%
AI Score42

13 Papers

LGJun 21, 2022
BiometricBlender: Ultra-high dimensional, multi-class synthetic data generator to imitate biometric feature space

Marcell Stippinger, Dávid Hanák, Marcell T. Kurbucz et al.

The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a wide range of feature screening methods. During the data generation process, the overall usefulness and the intercorrelations of blended features can be controlled by the user, thus the synthetic feature space is able to imitate the key properties of a real biometric dataset.

LGApr 27, 2023
LLT: An R package for Linear Law-based Feature Space Transformation

Marcell T. Kurbucz, Péter Pósfay, Antal Jakovác

The goal of the linear law-based feature space transformation (LLT) algorithm is to assist with the classification of univariate and multivariate time series. The presented R package, called LLT, implements this algorithm in a flexible yet user-friendly way. This package first splits the instances into training and test sets. It then utilizes time-delay embedding and spectral decomposition techniques to identify the governing patterns (called linear laws) of each input sequence (initial feature) within the training set. Finally, it applies the linear laws of the training set to transform the initial features of the test set. These steps are performed by three separate functions called trainTest, trainLaw, and testTrans. Their application requires a predefined data structure; however, for fast calculation, they use only built-in functions. The LLT R package and a sample dataset with the appropriate data structure are publicly available on GitHub.

LGJul 4, 2023
Learning ECG Signal Features Without Backpropagation Using Linear Laws

Péter Pósfay, Marcell T. Kurbucz, Péter Kovács et al.

This paper introduces LLT-ECG, a novel method for electrocardiogram (ECG) signal classification that leverages concepts from theoretical physics to automatically generate features from time series data. Unlike traditional deep learning approaches, LLT-ECG operates in a forward manner, eliminating the need for backpropagation and hyperparameter tuning. By identifying linear laws that capture shared patterns within specific classes, the proposed method constructs a compact and verifiable representation, enhancing the effectiveness of downstream classifiers. We demonstrate LLT-ECG's state-of-the-art performance on real-world ECG datasets from PhysioNet, underscoring its potential for medical applications where speed and verifiability are crucial.

STApr 27, 2023
Predicting the Price Movement of Cryptocurrencies Using Linear Law-based Transformation

Marcell T. Kurbucz, Péter Pósfay, Antal Jakovác

The aim of this paper is to investigate the effect of a novel method called linear law-based feature space transformation (LLT) on the accuracy of intraday price movement prediction of cryptocurrencies. To do this, the 1-minute interval price data of Bitcoin, Ethereum, Binance Coin, and Ripple between 1 January 2019 and 22 October 2022 were collected from the Binance cryptocurrency exchange. Then, 14-hour nonoverlapping time windows were applied to sample the price data. The classification was based on the first 12 hours, and the two classes were determined based on whether the closing price rose or fell after the next 2 hours. These price data were first transformed with the LLT, then they were classified by traditional machine learning algorithms with 10-fold cross-validation. Based on the results, LLT greatly increased the accuracy for all cryptocurrencies, which emphasizes the potential of the LLT algorithm in predicting price movements.

MEMay 12
When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression

Marcell T. Kurbucz

Calibrated probability outputs of trained classifiers are increasingly used as inputs to downstream regression estimands such as effects, prevalences, or disparities for a latent group observed only on a small labelled subset. A standard practice is to threshold the calibrated score at a confidence cutoff and treat the hard label as the truth. Building on a recent identification result for the underlying moment equation, we develop a calibration-aware diagnostic apparatus for pseudo-labelling pipelines. We derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient, and show that the bias can be predicted, before any inference is run, from the residual score variance $V^{*}=\mathbb{E}[\operatorname{Var}(p\mid X)]$ on the unlabelled set after partialling out the downstream controls $X$. We further obtain a sharp sensitivity bound under bounded calibration drift, and identify the boundary $V^{*}=0$, which holds iff $p$ is a deterministic function of $X$; this motivates a structural separation between classifier features $W$ and downstream controls $X\subsetneq W$. Five controlled simulations and a UCI Adult illustration trace the predictions. The contribution is operational: a $(V^{*}, κ)$ decision rule that practitioners can compute from any classifier output to decide whether confidence thresholding is safe.

LGApr 17, 2025Code
ALT: A Python Package for Lightweight Feature Representation in Time Series Classification

Balázs P. Halmos, Balázs Hajós, Vince Á. Molnár et al.

We introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.

LGJan 16, 2025
Adaptive Law-Based Transformation (ALT): A Lightweight Feature Representation for Time Series Classification

Marcell T. Kurbucz, Balázs Hajós, Balázs P. Halmos et al.

Time series classification (TSC) is fundamental in numerous domains, including finance, healthcare, and environmental monitoring. However, traditional TSC methods often struggle with the inherent complexity and variability of time series data. Building on our previous work with the linear law-based transformation (LLT) - which improved classification accuracy by transforming the feature space based on key data patterns - we introduce adaptive law-based transformation (ALT). ALT enhances LLT by incorporating variable-length shifted time windows, enabling it to capture distinguishing patterns of various lengths and thereby handle complex time series more effectively. By mapping features into a linearly separable space, ALT provides a fast, robust, and transparent solution that achieves state-of-the-art performance with only a few hyperparameters.

MEOct 25, 2024
Unified Causality Analysis Based on the Degrees of Freedom

András Telcs, Marcell T. Kurbucz, Antal Jakovác

Temporally evolving systems are typically modeled by dynamic equations. A key challenge in accurate modeling is understanding the causal relationships between subsystems, as well as identifying the presence and influence of unobserved hidden drivers on the observed dynamics. This paper presents a unified method capable of identifying fundamental causal relationships between pairs of systems, whether deterministic or stochastic. Notably, the method also uncovers hidden common causes beyond the observed variables. By analyzing the degrees of freedom in the system, our approach provides a more comprehensive understanding of both causal influence and hidden confounders. This unified framework is validated through theoretical models and simulations, demonstrating its robustness and potential for broader application.

LGMar 12
Adaptive Conditional Forest Sampling for Spectral Risk Optimisation under Decision-Dependent Uncertainty

Marcell T. Kurbucz

Minimising a spectral risk objective, defined as a convex combination of expected cost and Conditional Value-at-Risk (CVaR), is challenging when the uncertainty distribution is decision-dependent, making both surrogate modelling and simulation-based ranking sensitive to tail estimation error. We propose Adaptive Conditional Forest Sampling (ACFS), a four-phase simulation-optimisation framework that integrates Generalised Random Forests for decision-conditional distribution approximation, CEM-guided global exploration, rank-weighted focused augmentation, and surrogate-to-oracle two-stage reranking before multi-start gradient-based refinement. We evaluate ACFS on two structurally distinct data-generating processes: a decision-dependent Student-t copula and a Gaussian copula with log-normal marginals, across three penalty-weight configurations and 100 replications per setting. ACFS achieves the lowest median oracle spectral risk on the second benchmark in every configuration, with median gaps over GP-BO ranging from 6.0% to 20.0%. On the first benchmark, ACFS and GP-BO are statistically indistinguishable in median objective, but ACFS reduces cross-replication dispersion by approximately 1.8 to 1.9 times on the first benchmark and 1.7 to 2.0 times on the second, indicating materially improved run-to-run reliability. ACFS also outperforms CEM-SO, SGD-CVaR, and KDE-SO in nearly all settings, while ablation and sensitivity analyses support the contribution and robustness of the proposed design.

LGMay 21, 2025
SplitWise Regression: Stepwise Modeling with Adaptive Dummy Encoding

Marcell T. Kurbucz, Nikolaos Tzivanakis, Nilufer Sari Aslam et al.

Capturing nonlinear relationships without sacrificing interpretability remains a persistent challenge in regression modeling. We introduce SplitWise, a novel framework that enhances stepwise regression. It adaptively transforms numeric predictors into threshold-based binary features using shallow decision trees, but only when such transformations improve model fit, as assessed by the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). This approach preserves the transparency of linear models while flexibly capturing nonlinear effects. Implemented as a user-friendly R package, SplitWise is evaluated on both synthetic and real-world datasets. The results show that it consistently produces more parsimonious and generalizable models than traditional stepwise and penalized regression techniques.

LGMay 25, 2023
Feature space reduction method for ultrahigh-dimensional, multiclass data: Random forest-based multiround screening (RFMS)

Gergely Hanczár, Marcell Stippinger, Dávid Hanák et al.

In recent years, numerous screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features; however, most of these features cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods while simultaneously possessing many advantages over these methods.

LGFeb 23, 2022
Reconstruction of observed mechanical motions with Artificial Intelligence tools

Antal Jakovac, Marcell T. Kurbucz, Peter Posfay

The goal of this paper is to determine the laws of observed trajectories assuming that there is a mechanical system in the background and using these laws to continue the observed motion in a plausible way. The laws are represented by neural networks with a limited number of parameters. The training of the networks follows the Extreme Learning Machine idea. We determine laws for different levels of embedding, thus we can represent not only the equation of motion but also the symmetries of different kinds. In the recursive numerical evolution of the system, we require the fulfillment of all the observed laws, within the determined numerical precision. In this way, we can successfully reconstruct both integrable and chaotic motions, as we demonstrate in the example of the gravity pendulum and the double pendulum.

STJan 24, 2022
Linear Laws of Markov Chains with an Application for Anomaly Detection in Bitcoin Prices

Marcell T. Kurbucz, Péter Pósfay, Antal Jakovác

The goals of this paper are twofold: (1) to present a new method that is able to find linear laws governing the time evolution of Markov chains and (2) to apply this method for anomaly detection in Bitcoin prices. To accomplish these goals, first, the linear laws of Markov chains are derived by using the time embedding of their (categorical) autocorrelation function. Then, a binary series is generated from the first difference of Bitcoin exchange rate (against the United States Dollar). Finally, the minimum number of parameters describing the linear laws of this series is identified through stepped time windows. Based on the results, linear laws typically became more complex (containing an additional third parameter that indicates hidden Markov property) in two periods: before the crash of cryptocurrency markets inducted by the COVID-19 pandemic (12 March 2020), and before the record-breaking surge in the price of Bitcoin (Q4 2020 - Q1 2021). In addition, the locally high values of this third parameter are often related to short-term price peaks, which suggests price manipulation.