Fine-Grained Emotion Detection on GoEmotions: Experimental Comparison of Classical Machine Learning, BiLSTM, and Transformer Models
This work addresses the problem of fine-grained emotion detection for NLP researchers and practitioners, but it is incremental as it benchmarks existing methods on a known dataset.
The paper tackled fine-grained emotion recognition as a multi-label NLP task by benchmarking logistic regression, BiLSTM, and BERT models on the GoEmotions dataset, finding that logistic regression achieved the highest Micro-F1 of 0.51 while BERT attained the best overall balance with Macro-F1 0.49, Hamming Loss 0.036, and Subset Accuracy 0.36.
Fine-grained emotion recognition is a challenging multi-label NLP task due to label overlap and class imbalance. In this work, we benchmark three modeling families on the GoEmotions dataset: a TF-IDF-based logistic regression system trained with binary relevance, a BiLSTM with attention, and a BERT model fine-tuned for multi-label classification. Experiments follow the official train/validation/test split, and imbalance is mitigated using inverse-frequency class weights. Across several metrics, namely Micro-F1, Macro-F1, Hamming Loss, and Subset Accuracy, we observe that logistic regression attains the highest Micro-F1 of 0.51, while BERT achieves the best overall balance surpassing the official paper's reported results, reaching Macro-F1 0.49, Hamming Loss 0.036, and Subset Accuracy 0.36. This suggests that frequent emotions often rely on surface lexical cues, whereas contextual representations improve performance on rarer emotions and more ambiguous examples.