Exploring Selective Retrieval-Augmentation for Long-Tail Legal Text Classification
This addresses performance issues for rare classes in legal text classification, but it is incremental as it builds on existing retrieval-augmentation methods with a selective focus.
The paper tackled the problem of poor model performance on rare classes in long-tail legal text classification by exploring Selective Retrieval-Augmentation (SRA), which augments low-frequency labels without external data, and achieved consistent gains in micro-F1 and macro-F1 over LexGLUE baselines on LEDGAR and UNFAIR-ToS datasets.
Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper explores Selective Retrieval-Augmentation (SRA) as a proof-of-concept approach to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. SRA is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). Results show that SRA achieves consistent gains in both micro-F1 and macro-F1 over LexGLUE baselines.