AIMar 24

CLiGNet: Clinical Label-Interaction Graph Network for Medical Specialty Classification from Clinical Transcriptions

Pronob Kumar Barman, Pronoy Kumar Barman

arXiv:2603.227529.8h-index: 1

AI Analysis

This work addresses the need for accurate medical specialty classification for routing and decision support in clinical settings, though it is incremental as it builds on prior methods with a novel graph-based approach.

The paper tackled the problem of automated classification of clinical transcriptions into medical specialties by first identifying and fixing severe data leakage in the MTSamples benchmark, then introducing CLiGNet, which achieved a macro F1 of 0.279, with the GCN label graph providing a gain of 0.066.

Automated classification of clinical transcriptions into medical specialties is essential for routing, coding, and clinical decision support, yet prior work on the widely used MTSamples benchmark suffers from severe data leakage caused by applying SMOTE oversampling before train test splitting. We first document this methodological flaw and establish a leakage free benchmark across 40 medical specialties (4966 records), revealing that the true task difficulty is substantially higher than previously reported. We then introduce CLiGNet (Clinical Label Interaction Graph Network), a neural architecture that combines a Bio ClinicalBERT text encoder with a two layer Graph Convolutional Network operating on a specialty label graph constructed from semantic similarity and ICD 10 chapter priors. Per label attention gates fuse document and label graph representations, trained with focal binary cross entropy loss to handle extreme class imbalance (181 to 1 ratio). Across seven baselines ranging from TF IDF classifiers to Clinical Longformer, CLiGNet without calibration achieves the highest macro F1 of 0.279, with an ablation study confirming that the GCN label graph provides the single largest component gain (increase of 0.066 macro F1). Adding per label Platt scaling calibration yields an expected calibration error of 0.007, demonstrating a principled trade off between ranking performance and probability reliability. We provide comprehensive failure analysis covering pairwise specialty confusions, rare class behaviour, document length effects, and token level Integrated Gradients attribution, offering actionable insights for clinical NLP system deployment.

View on arXiv PDF

Similar