CLAIJul 30, 2023

LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

arXiv:2308.01413v44 citationsh-index: 54
Originality Incremental advance
AI Analysis

This addresses the problem of processing large files for NLP practitioners, offering an efficient and high-performing solution, though it is incremental as it builds on existing multiple instance learning and BERT techniques.

The paper tackles the challenge of classifying large files, which exceed typical token limits of Transformer models, by introducing LaFiCMIL, a method based on correlated multiple instance learning that scales BERT to handle nearly 20,000 tokens on a single GPU and achieves state-of-the-art performance across seven benchmark datasets.

Transfomer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. LaFiCMIL is optimized for efficient operation on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL's effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20,000 tokens while operating on a single GPU with 32GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL's potential as a groundbreaking approach in the field of large file classification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes