CLLGMay 21, 2025

MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation

arXiv:2505.15696v23 citationsh-index: 6EMNLP
Originality Incremental advance
AI Analysis

This work addresses classification performance issues for users of BERT models, particularly in low-resource scenarios, but is incremental as it builds on existing BERT architectures with lightweight modifications.

The paper tackled the problem of BERT's fixed [CLS] token representation for classification by proposing MaxPoolBERT, which aggregates information across layers and tokens, resulting in enhanced classification accuracy on low-resource GLUE tasks compared to standard BERT base.

The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we study lightweight extensions to BERT that refine the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach, called MaxPoolBERT, enhances BERT's classification accuracy (especially on low-resource tasks) without requiring new pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance than the standard BERT base model on low resource tasks of the GLUE benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes