CL IRMar 4, 2025

Zero-Shot Multi-Label Classification of Bangla Documents: Large Decoders Vs. Classic Encoders

Souvika Sarkar, Md. Najib Hasan, Santu Karmaker

arXiv:2503.02993v18.32 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of limited NLP resources for Bangla, spoken by over 300 million people, but is incremental as it establishes a benchmark without introducing new methods.

The paper tackled the problem of zero-shot multi-label classification for Bangla documents by comparing large decoder-based models with classic encoder-based models, finding that both types struggled to achieve high accuracy, indicating a need for more research and resources in Bangla NLP.

Bangla, a language spoken by over 300 million native speakers and ranked as the sixth most spoken language worldwide, presents unique challenges in natural language processing (NLP) due to its complex morphological characteristics and limited resources. While recent Large Decoder Based models (LLMs), such as GPT, LLaMA, and DeepSeek, have demonstrated excellent performance across many NLP tasks, their effectiveness in Bangla remains largely unexplored. In this paper, we establish the first benchmark comparing decoder-based LLMs with classic encoder-based models for Zero-Shot Multi-Label Classification (Zero-Shot-MLC) task in Bangla. Our evaluation of 32 state-of-the-art models reveals that, existing so-called powerful encoders and decoders still struggle to achieve high accuracy on the Bangla Zero-Shot-MLC task, suggesting a need for more research and resources for Bangla NLP.

View on arXiv PDF

Similar