Large Language Models Do Multi-Label Classification Differently
This addresses the understudied behavior of LLMs in multi-label classification for subjective tasks, providing practical improvements for researchers and practitioners working with such models.
The paper investigates how autoregressive large language models perform multi-label classification in subjective tasks, finding that their initial probability distributions often don't reflect final outputs and that they tend to suppress all but one label at each generation step. The researchers propose distribution alignment methods that improve both alignment with annotator responses and predictive performance, with one simple method increasing F1 classification without additional computation.
Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, focusing on subjective tasks, by analyzing the output distributions of the models at each label generation step. We find that the initial probability distribution for the first label often does not reflect the eventual final output, even in terms of relative order and find LLMs tend to suppress all but one label at each generation step. We further observe that as model scale increases, their token distributions exhibit lower entropy and higher single-label confidence, but the internal relative ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. We introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches. We find one method -- taking the max probability over all label generation distributions instead of just using the initial probability distribution -- improves both distribution alignment and overall F1 classification without adding any additional computation.