Classifying long legal documents using short random chunks
This addresses the problem of efficient and accurate classification for legal professionals dealing with lengthy documents, though it is incremental as it adapts existing methods to a specific domain.
The paper tackled the challenge of classifying long legal documents by developing a classifier using DeBERTa V3 and LSTM that processes 48 random short chunks per document, achieving a weighted F-score of 0.898 and a median processing time of 498 seconds per 100 files on CPU.
Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.