Pavithra Rajendran

31.0CLApr 5Code

A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Avish Vijayaraghavan, Jaskaran Singh Kawatra, Sebin Sabu et al.

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

CLFeb 26, 2021

A Meta-embedding-based Ensemble Approach for ICD Coding Prediction

Pavithra Rajendran, Alexandros Zenonos, Josh Spear et al.

International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. The problem of automatically assigning ICD codes has been approached in literature as a multilabel classification, using neural models on unstructured data. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles. Furthermore, we exploit the geometric properties of the two sets of word vectors and combine them into a common dimensional space, using meta-embedding techniques. We demonstrate the efficacy of this approach for a multimodal setting, using unstructured and structured information. We empirically show that our approach improves the current state-of-the-art deep learning architectures and benefits ensemble models.

Pavithra Rajendran

2 Papers