Classifying Long Clinical Documents with Pre-trained Transformers
This addresses the challenge of handling long documents in clinical text classification, but it is incremental as it builds on existing transformer methods.
The paper tackled the problem of classifying long clinical documents for automatic phenotyping by evaluating strategies to incorporate pre-trained sentence encoders into document-level representations, finding that hierarchical transformers without pre-training are competitive with task pre-trained models.
Automatic phenotyping is a task of identifying cohorts of patients that match a predefined set of criteria. Phenotyping typically involves classifying long clinical documents that contain thousands of tokens. At the same time, recent state-of-art transformer-based pre-trained language models limit the input to a few hundred tokens (e.g. 512 tokens for BERT). We evaluate several strategies for incorporating pre-trained sentence encoders into document-level representations of clinical text, and find that hierarchical transformers without pre-training are competitive with task pre-trained models.