Richard Dobson

CL
h-index16
17papers
2,309citations
Novelty35%
AI Score31

17 Papers

CLOct 5, 2023Code
Validating transformers for redaction of text from electronic health records in real-world healthcare

Zeljko Kraljevic, Anthony Shek, Joshua Au Yeung et al.

Protecting patient privacy in healthcare records is a top priority, and redaction is a commonly used method for obscuring directly identifiable information in text. Rule-based methods have been widely used, but their precision is often low causing over-redaction of text and frequently not being adaptable enough for non-standardised or unconventional structures of personal health information. Deep learning techniques have emerged as a promising solution, but implementing them in real-world environments poses challenges due to the differences in patient record structure and language across different departments, hospitals, and countries. In this study, we present AnonCAT, a transformer-based model and a blueprint on how deidentification models can be deployed in real-world healthcare. AnonCAT was trained through a process involving manually annotated redactions of real-world documents from three UK hospitals with different electronic health record systems and 3116 documents. The model achieved high performance in all three hospitals with a Recall of 0.99, 0.99 and 0.96. Our findings demonstrate the potential of deep learning techniques for improving the efficiency and accuracy of redaction in global healthcare data and highlight the importance of building workflows which not just use these models but are also able to continually fine-tune and audit the performance of these algorithms to ensure continuing effectiveness in real-world settings. This approach provides a blueprint for the real-world use of de-identifying algorithms through fine-tuning and localisation, the code together with tutorials is available on GitHub (https://github.com/CogStack/MedCAT).

CLNov 14, 2022
Discharge Summary Hospital Course Summarisation of In Patient Electronic Health Record Text with Clinical Concept Guided Deep Pre-Trained Transformer Models

Thomas Searle, Zina Ibrahim, James Teo et al.

Brief Hospital Course (BHC) summaries are succinct summaries of an entire hospital encounter, embedded within discharge summaries, written by senior clinicians responsible for the overall care of a patient. Methods to automatically produce summaries from inpatient documentation would be invaluable in reducing clinician manual burden of summarising documents under high time-pressure to admit and discharge patients. Automatically producing these summaries from the inpatient course, is a complex, multi-document summarisation task, as source notes are written from various perspectives (e.g. nursing, doctor, radiology), during the course of the hospitalisation. We demonstrate a range of methods for BHC summarisation demonstrating the performance of deep learning summarisation models across extractive and abstractive summarisation scenarios. We also test a novel ensemble extractive and abstractive summarisation model that incorporates a medical concept ontology (SNOMED) as a clinical guidance signal and shows superior performance in 2 real-world clinical data sets.

CLAug 30, 2024
Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study

Shubham Agarwal, Thomas Searle, Mart Ratas et al.

Electronic Health Records are large repositories of valuable clinical data, with a significant portion stored in unstructured text format. This textual data includes clinical events (e.g., disorders, symptoms, findings, medications and procedures) in context that if extracted accurately at scale can unlock valuable downstream applications such as disease prediction. Using an existing Named Entity Recognition and Linking methodology, MedCAT, these identified concepts need to be further classified (contextualised) for their relevance to the patient, and their temporal and negated status for example, to be useful downstream. This study performs a comparative analysis of various natural language models for medical text classification. Extensive experimentation reveals the effectiveness of transformer-based language models, particularly BERT. When combined with class imbalance mitigation techniques, BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes. The method has been implemented as part of CogStack/MedCAT framework and made available to the community for further research.

LGJul 11, 2024
How Deep is your Guess? A Fresh Perspective on Deep Learning for Medical Time-Series Imputation

Linglong Qian, Tao Wang, Jun Wang et al.

We present a comprehensive analysis of deep learning approaches for Electronic Health Record (EHR) time-series imputation, examining how architectural and framework biases combine to influence model performance. Our investigation reveals varying capabilities of deep imputers in capturing complex spatiotemporal dependencies within EHRs, and that model effectiveness depends on how its combined biases align with medical time-series characteristics. Our experimental evaluation challenges common assumptions about model complexity, demonstrating that larger models do not necessarily improve performance. Rather, carefully designed architectures can better capture the complex patterns inherent in clinical data. The study highlights the need for imputation approaches that prioritise clinically meaningful data reconstruction over statistical accuracy. Our experiments show imputation performance variations of up to 20\% based on preprocessing and implementation choices, emphasising the need for standardised benchmarking methodologies. Finally, we identify critical gaps between current deep imputation methods and medical requirements, highlighting the importance of integrating clinical insights to achieve more reliable imputation approaches for healthcare applications.

CLDec 18, 2019Code
MedCAT -- Medical Concept Annotation Tool

Zeljko Kraljevic, Daniel Bean, Aurelie Mascio et al.

Biomedical documents such as Electronic Health Records (EHRs) contain a large amount of information in an unstructured format. The data in EHRs is a hugely valuable resource documenting clinical narratives and decisions, but whilst the text can be easily understood by human doctors it is challenging to use in research and clinical applications. To uncover the potential of biomedical documents we need to extract and structure the information they contain. The task at hand is Named Entity Recognition and Linking (NER+L). The number of entities, ambiguity of words, overlapping and nesting make the biomedical area significantly more difficult than many others. To overcome these difficulties, we have developed the Medical Concept Annotation Tool (MedCAT), an open-source unsupervised approach to NER+L. MedCAT uses unsupervised machine learning to disambiguate entities. It was validated on MIMIC-III (a freely accessible critical care database) and MedMentions (Biomedical papers annotated with mentions from the Unified Medical Language System). In case of NER+L, the comparison with existing tools shows that MedCAT improves the previous best with only unsupervised learning (F1=0.848 vs 0.691 for disease detection; F1=0.710 vs. 0.222 for general concept detection). A qualitative analysis of the vector embeddings learnt by MedCAT shows that it captures latent medical knowledge available in EHRs (MIMIC-III). Unsupervised learning can improve the performance of large scale entity extraction, but it has some limitations when working with only a couple of entities and a small dataset. In that case options are supervised learning or active learning, both of which are supported in MedCAT via the MedCATtrainer extension. Our approach can detect and link millions of different biomedical concepts with state-of-the-art performance, whilst being lightweight, fast and easy to use.

LGJan 4, 2024
Uncertainty-Aware Deep Attention Recurrent Neural Network for Heterogeneous Time Series Imputation

Linglong Qian, Zina Ibrahim, Richard Dobson

Missingness is ubiquitous in multivariate time series and poses an obstacle to reliable downstream analysis. Although recurrent network imputation achieved the SOTA, existing models do not scale to deep architectures that can potentially alleviate issues arising in complex data. Moreover, imputation carries the risk of biased estimations of the ground truth. Yet, confidence in the imputed values is always unmeasured or computed post hoc from model output. We propose DEep Attention Recurrent Imputation (DEARI), which jointly estimates missing values and their associated uncertainty in heterogeneous multivariate time series. By jointly representing feature-wise correlations and temporal dynamics, we adopt a self attention mechanism, along with an effective residual component, to achieve a deep recurrent neural network with good imputation performance and stable convergence. We also leverage self-supervised metric learning to boost performance by optimizing sample similarity. Finally, we transform DEARI into a Bayesian neural network through a novel Bayesian marginalization strategy to produce stochastic DEARI, which outperforms its deterministic equivalent. Experiments show that DEARI surpasses the SOTA in diverse imputation tasks using real-world datasets, namely air quality control, healthcare and traffic.

CLMar 26, 2025
Clean & Clear: Feasibility of Safe LLM Clinical Guidance

Julia Ive, Felix Jozsa, Nick Jackson et al.

Background: Clinical guidelines are central to safe evidence-based medicine in modern healthcare, providing diagnostic criteria, treatment options and monitoring advice for a wide range of illnesses. LLM-empowered chatbots have shown great promise in Healthcare Q&A tasks, offering the potential to provide quick and accurate responses to medical inquiries. Our main objective was the development and preliminary assessment of an LLM-empowered chatbot software capable of reliably answering clinical guideline questions using University College London Hospital (UCLH) clinical guidelines. Methods: We used the open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions. Our approach highlights the safety and reliability of referencing information over its interpretation and response generation. Seven doctors from the ward assessed the chatbot's performance by comparing its answers to the gold standard. Results: Our chatbot demonstrates promising performance in terms of relevance, with ~73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context. Importantly, our chatbot achieves a recall of 1.00 for extracted guideline lines, substantially minimising the risk of missing critical information. Approximately 78% of responses were rated satisfactory in terms of completeness. A small portion (~14.5%) contained minor unnecessary information, indicating occasional lapses in precision. The chatbot' showed high efficiency, with an average completion time of 10 seconds, compared to 30 seconds for human respondents. Evaluation of clinical reasoning showed that 72% of the chatbot's responses were without flaws. Our chatbot demonstrates significant potential to speed up and improve the process of accessing locally relevant clinical information for healthcare professionals.

CLJan 27, 2025
RelCAT: Advancing Extraction of Clinical Inter-Entity Relationships from Unstructured Electronic Health Records

Shubham Agarwal, Vlad Dinu, Thomas Searle et al.

This study introduces RelCAT (Relation Concept Annotation Toolkit), an interactive tool, library, and workflow designed to classify relations between entities extracted from clinical narratives. Building upon the CogStack MedCAT framework, RelCAT addresses the challenge of capturing complete clinical relations dispersed within text. The toolkit implements state-of-the-art machine learning models such as BERT and Llama along with proven evaluation and training methods. We demonstrate a dataset annotation tool (built within MedCATTrainer), model training, and evaluate our methodology on both openly available gold-standard and real-world UK National Health Service (NHS) hospital clinical datasets. We perform extensive experimentation and a comparative analysis of the various publicly available models with varied approaches selected for model fine-tuning. Finally, we achieve macro F1-scores of 0.977 on the gold-standard n2c2, surpassing the previous state-of-the-art performance, and achieve performance of >=0.93 F1 on our NHS gathered datasets.

CLJul 7, 2021
MedGPT: Medical Concept Prediction from Clinical Narratives

Zeljko Kraljevic, Anthony Shek, Daniel Bean et al.

The data available in Electronic Health Records (EHRs) provides the opportunity to transform care, and the best way to provide better care for one patient is through learning from the data available on all other patients. Temporal modelling of a patient's medical history, which takes into account the sequence of past events, can be used to predict future events such as a diagnosis of a new disorder or complication of a previous or existing disorder. While most prediction approaches use mostly the structured data in EHRs or a subset of single-domain predictions and outcomes, we present MedGPT a novel transformer-based pipeline that uses Named Entity Recognition and Linking tools (i.e. MedCAT) to structure and organize the free text portion of EHRs and anticipate a range of future medical events (initially disorders). Since a large portion of EHR data is in text form, such an approach benefits from a granular and detailed view of a patient while introducing modest additional noise. MedGPT effectively deals with the noise and the added granularity, and achieves a precision of 0.344, 0.552 and 0.640 (vs LSTM 0.329, 0.538 and 0.633) when predicting the top 1, 3 and 5 candidate future disorders on real world hospital data from King's College Hospital, London, UK (\textasciitilde600k patients). We also show that our model captures medical knowledge by testing it on an experimental medical multiple choice question answering task, and by examining the attentional focus of the model using gradient-based saliency methods.

LGJun 12, 2020
Comparing Natural Language Processing Techniques for Alzheimer's Dementia Prediction in Spontaneous Speech

Thomas Searle, Zina Ibrahim, Richard Dobson

Alzheimer's Dementia (AD) is an incurable, debilitating, and progressive neurodegenerative condition that affects cognitive function. Early diagnosis is important as therapeutics can delay progression and give those diagnosed vital time. Developing models that analyse spontaneous speech could eventually provide an efficient diagnostic modality for earlier diagnosis of AD. The Alzheimer's Dementia Recognition through Spontaneous Speech task offers acoustically pre-processed and balanced datasets for the classification and prediction of AD and associated phenotypes through the modelling of spontaneous speech. We exclusively analyse the supplied textual transcripts of the spontaneous speech dataset, building and comparing performance across numerous models for the classification of AD vs controls and the prediction of Mental Mini State Exam scores. We rigorously train and evaluate Support Vector Machines (SVMs), Gradient Boosting Decision Trees (GBDT), and Conditional Random Fields (CRFs) alongside deep learning Transformer based models. We find our top performing models to be a simple Term Frequency-Inverse Document Frequency (TF-IDF) vectoriser as input into a SVM model and a pre-trained Transformer based model `DistilBERT' when used as an embedding layer into simple linear models. We demonstrate test set scores of 0.81-0.82 across classification metrics and a RMSE of 4.58.

CLMay 8, 2020
Comparative Analysis of Text Classification Approaches in Electronic Health Records

Aurelie Mascio, Zeljko Kraljevic, Daniel Bean et al.

Text classification tasks which aim at harvesting and/or organizing information from electronic health records are pivotal to support clinical and translational research. However these present specific challenges compared to other classification tasks, notably due to the particular nature of the medical lexicon and language used in clinical records. Recent advances in embedding methods have shown promising results for several clinical tasks, yet there is no exhaustive comparison of such approaches with other commonly used word representations and classification models. In this work, we analyse the impact of various word representations, text pre-processing and classification algorithms on the performance of four different text classification tasks. The results show that traditional approaches, when tailored to the specific language and structure of the text inherent to the classification task, can achieve or exceed the performance of more recent ones based on contextual embeddings such as BERT.

AIApr 3, 2020
Modeling Rare Interactions in Time Series Data Through Qualitative Change: Application to Outcome Prediction in Intensive Care Units

Zina Ibrahim, Honghan Wu, Richard Dobson

Many areas of research are characterised by the deluge of large-scale highly-dimensional time-series data. However, using the data available for prediction and decision making is hampered by the current lag in our ability to uncover and quantify true interactions that explain the outcomes.We are interested in areas such as intensive care medicine, which are characterised by i) continuous monitoring of multivariate variables and non-uniform sampling of data streams, ii) the outcomes are generally governed by interactions between a small set of rare events, iii) these interactions are not necessarily definable by specific values (or value ranges) of a given group of variables, but rather, by the deviations of these values from the normal state recorded over time, iv) the need to explain the predictions made by the model. Here, while numerous data mining models have been formulated for outcome prediction, they are unable to explain their predictions. We present a model for uncovering interactions with the highest likelihood of generating the outcomes seen from highly-dimensional time series data. Interactions among variables are represented by a relational graph structure, which relies on qualitative abstractions to overcome non-uniform sampling and to capture the semantics of the interactions corresponding to the changes and deviations from normality of variables of interest over time. Using the assumption that similar templates of small interactions are responsible for the outcomes (as prevalent in the medical domains), we reformulate the discovery task to retrieve the most-likely templates from the data.

CLFeb 7, 2020
Identifying physical health comorbidities in a cohort of individuals with severe mental illness: An application of SemEHR

Rebecca Bendayan, Honghan Wu, Zeljko Kraljevic et al.

Multimorbidity research in mental health services requires data from physical health conditions which is traditionally limited in mental health care electronic health records. In this study, we aimed to extract data from physical health conditions from clinical notes using SemEHR. Data was extracted from Clinical Record Interactive Search (CRIS) system at South London and Maudsley Biomedical Research Centre (SLaM BRC) and the cohort consisted of all individuals who had received a primary or secondary diagnosis of severe mental illness between 2007 and 2018. Three pairs of annotators annotated 2403 documents with an average Cohen's Kappa of 0.757. Results show that the NLP performance varies across different diseases areas (F1 0.601 - 0.954) suggesting that the language patterns or terminologies of different condition groups entail different technical challenges to the same NLP task.

IRJan 27, 2020
The side effect profile of Clozapine in real world data of three large mental hospitals

Ehtesham Iqbal, Risha Govind, Alvin Romero et al.

Objective: Mining the data contained within Electronic Health Records (EHRs) can potentially generate a greater understanding of medication effects in the real world, complementing what we know from Randomised control trials (RCTs). We Propose a text mining approach to detect adverse events and medication episodes from the clinical text to enhance our understanding of adverse effects related to Clozapine, the most effective antipsychotic drug for the management of treatment-resistant schizophrenia, but underutilised due to concerns over its side effects. Material and Methods: We used data from de-identified EHRs of three mental health trusts in the UK (>50 million documents, over 500,000 patients, 2835 of which were prescribed Clozapine). We explored the prevalence of 33 adverse effects by age, gender, ethnicity, smoking status and admission type three months before and after the patients started Clozapine treatment. We compared the prevalence of adverse effects with those reported in the Side Effects Resource (SIDER) where possible. Results: Sedation, fatigue, agitation, dizziness, hypersalivation, weight gain, tachycardia, headache, constipation and confusion were amongst the highest recorded Clozapine adverse effect in the three months following the start of treatment. Higher percentages of all adverse effects were found in the first month of Clozapine therapy. Using a significance level of (p< 0.05) out chi-square tests show a significant association between most of the ADRs in smoking status and hospital admissions and some in gender and age groups. Further, the data was combined from three trusts, and chi-square tests were applied to estimate the average effect of ADRs in each monthly interval. Conclusion: A better understanding of how the drug works in the real world can complement clinical trials and precision medicine.

QMDec 2, 2019
On Classifying Sepsis Heterogeneity in the ICU: Insight Using Machine Learning

Zina Ibrahim, Honghan Wu, Ahmed Hamoud et al.

Current machine learning models aiming to predict sepsis from Electronic Health Records (EHR) do not account for the heterogeneity of the condition, despite its emerging importance in prognosis and treatment. This work demonstrates the added value of stratifying the types of organ dysfunction observed in patients who develop sepsis in the ICU in improving the ability to recognise patients at risk of sepsis from their EHR data. Using an ICU dataset of 13,728 records, we identify clinically significant sepsis subpopulations with distinct organ dysfunction patterns. Classification experiments using Random Forest, Gradient Boost Trees and Support Vector Machines, aiming to distinguish patients who develop sepsis in the ICU from those who do not, show that features selected using sepsis subpopulations as background knowledge yield a superior performance regardless of the classification model used. Our findings can steer machine learning efforts towards more personalised models for complex conditions including sepsis.

HCJul 16, 2019
MedCATTrainer: A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation

Thomas Searle, Zeljko Kraljevic, Rebecca Bendayan et al.

We present MedCATTrainer an interface for building, improving and customising a given Named Entity Recognition and Linking (NER+L) model for biomedical domain text. NER+L is often used as a first step in deriving value from clinical text. Collecting labelled data for training models is difficult due to the need for specialist domain knowledge. MedCATTrainer offers an interactive web-interface to inspect and improve recognised entities from an underlying NER+L model via active learning. Secondary use of data for clinical research often has task and context specific criteria. MedCATTrainer provides a further interface to define and collect supervised learning training data for researcher specific use cases. Initial results suggest our approach allows for efficient and accurate collection of research use case specific training data.

CLNov 23, 2018
Application of Clinical Concept Embeddings for Heart Failure Prediction in UK EHR data

Spiros Denaxas, Pontus Stenetorp, Sebastian Riedel et al.

Electronic health records (EHR) are increasingly being used for constructing disease risk prediction models. Feature engineering in EHR data however is challenging due to their highly dimensional and heterogeneous nature. Low-dimensional representations of EHR data can potentially mitigate these challenges. In this paper, we use global vectors (GloVe) to learn word embeddings for diagnoses and procedures recorded using 13 million ontology terms across 2.7 million hospitalisations in national UK EHR. We demonstrate the utility of these embeddings by evaluating their performance in identifying patients which are at higher risk of being hospitalised for congestive heart failure. Our findings indicate that embeddings can enable the creation of robust EHR-derived disease risk prediction models and address some the limitations associated with manual clinical feature engineering.