CL IR LGJun 27, 2024

TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers

Majd Saleh, Sarra Baghdadi, Stéphane Paquelet

arXiv:2406.19526v13.44 citations

Originality Synthesis-oriented

AI Analysis

This work addresses medical document structure extraction for NLP tasks like information retrieval, but it is incremental as it applies an existing transformer method to a specific domain.

The paper tackled the problem of segmenting medical documents by proposing TocBERT, a supervised method using bidirectional transformers fine-tuned on the MIMIC-III dataset, achieving F1-scores of 84.6% for linear segmentation and 72.8% for hierarchical segmentation.

Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.

View on arXiv PDF

Similar