LG HCMar 4

Evaluating Large Language Models for Gait Classification Using Text-Encoded Kinematic Waveforms

Carlo Dindorf, Jonas Dully, Rebecca Keilhauer, Michael Lorenz, Michael Fröhlich

arXiv:2603.13317h-index: 10

AI Analysis

This work addresses the need for interpretable gait analysis tools in clinical settings, but it is incremental as it shows LLMs underperform compared to conventional methods.

The study evaluated whether general-purpose large language models (LLMs) could classify gait patterns from text-encoded kinematic data, finding that the best LLM achieved a multiclass Matthews Correlation Coefficient (MCC) of 0.70, which was lower than a supervised KNN classifier's MCC of 0.88, though LLM performance improved to 0.83 when filtered for high-confidence predictions.

Background: Machine learning (ML) enhances gait analysis but often lacks the level of interpretability desired for clinical adoption. Large Language Models (LLMs) may offer explanatory capabilities and confidence-aware outputs when applied to structured kinematic data. This study therefore evaluated whether general-purpose LLMs can classify continuous gait kinematics when represented as textual numeric sequences and how their performance compares to conventional ML approaches. Methods: Lower-body kinematics were recorded from 20 participants performing seven gait patterns. A supervised KNN classifier and a class-independent One-Class SVM (OCSVM) were compared against zero-shot LLMs (GPT-5, GPT-5-mini, GPT-4.1, and o4-mini). Models were evaluated using Leave-One-Subject-Out (LOSO) cross-validation. LLMs were tested both with and without explicit reference gait statistics. Results: The supervised KNN achieved the highest performance (multiclass Matthews Correlation Coefficient, MCC = 0.88). The best-performing LLM (GPT-5) with reference grounding achieved a multiclass MCC of 0.70 and a binary MCC of 0.68, outperforming the class-independent OCSVM (binary MCC = 0.60). Performance of the LLM was highly dependent on explicit reference information and self-rated confidence; when restricted to high-confidence predictions, multiclass MCC increased to 0.83 on the filtered subset. Notably, the computationally efficient o4-mini model performed comparably to larger models. Conclusion: When continuous kinematic waveforms were encoded as textual numeric tokens, general-purpose LLMs, even with reference grounding, did not match supervised multiclass classifiers for precise gait classification and are better regarded as exploratory systems requiring cautious, human-guided interpretation rather than diagnostic use.

View on arXiv PDF

Similar