Florian von Wangenheim

HC
h-index29
5papers
2citations
Novelty55%
AI Score48

5 Papers

IRJul 5, 2024
EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context

Hannes Kunstmann, Joseph Ollier, Joel Persson et al.

Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of $0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.

MEFeb 23
Detecting and Mitigating Group Bias in Heterogeneous Treatment Effects

Joel Persson, Jurriën Bakker, Dennis Bohle et al.

Heterogeneous treatment effects (HTEs) are increasingly estimated using machine learning models that produce highly personalized predictions of treatment effects. In practice, however, predicted treatment effects are rarely interpreted, reported, or audited at the individual level but, instead, are often aggregated to broader subgroups, such as demographic segments, risk strata, or markets. We show that such aggregation can induce systematic bias of the group-level causal effect: even when models for predicting the individual-level conditional average treatment effect (CATE) are correctly specified and trained on data from randomized experiments, aggregating the predicted CATEs up to the group level does not, in general, recover the corresponding group average treatment effect (GATE). We develop a unified statistical framework to detect and mitigate this form of group bias in randomized experiments. We first define group bias as the discrepancy between the model-implied and experimentally identified GATEs, derive an asymptotically normal estimator, and then provide a simple-to-implement statistical test. For mitigation, we propose a shrinkage-based bias-correction, and show that the theoretically optimal and empirically feasible solutions have closed-form expressions. The framework is fully general, imposes minimal assumptions, and only requires computing sample moments. We analyze the economic implications of mitigating detected group bias for profit-maximizing personalized targeting, thereby characterizing when bias correction alters targeting decisions and profits, and the trade-offs involved. Applications to large-scale experimental data at major digital platforms validate our theoretical results and demonstrate empirical performance.

HCMay 22
Detecting Drunk Driving Using Off-the-Shelf Smartwatches

Robin Deuber, Lanlan Yang, Michal Bechny et al.

Alcohol-impaired driving remains a major yet preventable cause of road traffic injury and death, with many drivers underestimating their level of intoxication. Compared to in-vehicle systems, mobile drunk-driving detection using consumer smartwatches offers a scalable way to trigger preventive interventions and increase awareness without additional in-vehicle hardware. We introduce a system that leverages wrist accelerometer data and heart rate variability-derived physiological signals to detect alcohol-related driving impairment. We collected data in a randomized, controlled three-arm test-track study (n=54) and trained both logistic regression models with window-aggregated features and a two-tower 1D convolutional neural network (CNN), to detect alcohol-impaired driving. The CNN achieved a participant-averaged area under the receiver operating characteristic (AUROC) of 0.88 for detecting any alcohol intoxication and 0.86 for detecting driving above the WHO-recommended limit of 0.05 g/dL. To the best of our knowledge, this is the first work to (1) demonstrate drunk-driving detection using consumer smartwatches, (2) develop and evaluate such a system in a real vehicle on a closed test track, and (3) rigorously assess generalization to unseen participants. Together, these findings highlight the potential of wearable-based sensing to support scalable, measurement-driven prevention of alcohol-related traffic harm.

HCMar 7
Pre-Clinical Latency Characterization of VRxBioRelax: A Real-Time EMG Biofeedback System for Muscle Relaxation in Virtual Reality

Melanie Baumgartner, Raphael Weibel, Tobias Hoesli et al.

Chronic tension in the upper trapezius (UT), often caused by poor ergonomics, prolonged posture, or psychological stress, contributes to musculoskeletal discomfort, headaches, and impaired interoceptive awareness. Although surface electromyography (sEMG) biofeedback can promote UT relaxation, traditional systems using conventional displays often fail to sustain engagement. Virtual reality (VR) offers a more immersive alternative, provided that latency remains below perceptual thresholds. We introduce VRxBioRelax, a closed-loop VR biofeedback system that streams sEMG data from Delsys Trigno Avanti sensors via MQTT to a Unity scene. Muscle activation drives a dynamic dawn-to-dusk landscape synchronized with a progressive muscle relaxation protocol. To validate system responsiveness, 87,716 EMG samples from the NinaPro DB2 dataset were replayed at $\sim$75 Hz. Timestamps at four key stages-acquisition, Root Mean Square (RMS) processing, network receipt, and rendering-revealed mean latencies of 0.50 ms (processing), 5.62 ms (network), and 19.22 ms (rendering), yielding an average end-to-end delay of 25.34 ms. Notably, 99.3% of frames arrived within 50 ms. One-sided t-tests confirmed mean latency was significantly lower than both the 30 ms VR comfort limit ($t_{87\,715}=-25.2$, $p=5.9{\times}10^{-140}$) and the 50 ms clinical benchmark ($t_{87\,715}=-133.3$, $p<10^{-300}$). These findings support VRxBioRelax for use in remote interoceptive training, stress reduction, and telepresence-enabled rehabilitation.

LGOct 3, 2025
RAxSS: Retrieval-Augmented Sparse Sampling for Explainable Variable-Length Medical Time Series Classification

Aydin Javadov, Samir Garibov, Tobias Hoesli et al.

Medical time series analysis is challenging due to data sparsity, noise, and highly variable recording lengths. Prior work has shown that stochastic sparse sampling effectively handles variable-length signals, while retrieval-augmented approaches improve explainability and robustness to noise and weak temporal correlations. In this study, we generalize the stochastic sparse sampling framework for retrieval-informed classification. Specifically, we weight window predictions by within-channel similarity and aggregate them in probability space, yielding convex series-level scores and an explicit evidence trail for explainability. Our method achieves competitive iEEG classification performance and provides practitioners with greater transparency and explainability. We evaluate our method in iEEG recordings collected in four medical centers, demonstrating its potential for reliable and explainable clinical variable-length time series classification.