LGSep 9, 2025

Large language models surpass domain-specific architectures for antepartum electronic fetal monitoring analysis

Sheng Wong, Ravi Shankar, Beth Albert, Gabriel Davis Jones

arXiv:2509.18112v24.12 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the need for accurate fetal health assessment in medical diagnostics, showing that LLMs can surpass specialized systems, but it is incremental as it builds on existing model comparisons in a specific domain.

The study tackled the problem of automated antepartum electronic fetal monitoring (CTG) classification by benchmarking over 15 models, including domain-specific and large language models (LLMs), on over 2,500 recordings, finding that fine-tuned LLMs consistently outperformed other models except when uterine-activity signals were absent, though with higher computational costs.

Foundation models (FMs) and large language models (LLMs) have demonstrated promising generalization across diverse domains for time-series analysis, yet their potential for electronic fetal monitoring (EFM) and cardiotocography (CTG) analysis remains underexplored. Most existing CTG studies relied on domain-specific models and lack systematic comparisons with modern foundation or language models, limiting our understanding of whether these models can outperform specialized systems in fetal health assessment. In this study, we present the first comprehensive benchmark of state-of-the-art architectures for automated antepartum CTG classification. Over 2,500 20-minutes recordings were used to evaluate over 15 models spanning domain-specific, time-series, foundation, and language-model categories under a unified framework. Fine-tuned LLMs consistently outperformed both foundation and domain-specific models across data-availability scenarios, except when uterine-activity signals were absent, where domain-specific models showed greater robustness. These performance gains, however, required substantially higher computational resources. Our results highlight that while fine-tuned LLMs achieved state-of-the-art performance for CTG classification, practical deployment must balance performance with computational efficiency.

View on arXiv PDF

Similar