Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration

arXiv:2602.0091320.32 citationsh-index: 2

AI Analysis

This work addresses the challenge of human value detection for NLP applications, but it is incremental as it focuses on comparing existing methods rather than introducing new architectures.

The study tackled the problem of detecting human values from single sentences, a sparse and imbalanced multi-label task, by evaluating whether Schwartz higher-order categories improve performance under a compute-frugal budget on the ValueEval'24/ValuesML dataset. The results showed that calibration and ensembling provided reliable gains, such as threshold tuning improving Social Focus vs. Personal Focus from 0.41 to 0.57, while hard hierarchical gating did not consistently help.

Human value detection from single sentences is a sparse, imbalanced multi-label task. We study whether Schwartz higher-order (HO) categories help this setting on ValueEval'24 / ValuesML (74K English sentences) under a compute-frugal budget. Rather than proposing a new architecture, we compare direct supervised transformers, hard HO$\rightarrow$values pipelines, Presence$\rightarrow$HO$\rightarrow$values cascades, compact instruction-tuned large language models (LLMs), QLoRA, and low-cost upgrades such as threshold tuning and small ensembles. HO categories are learnable: the easiest bipolar pair, Growth vs. Self-Protection, reaches Macro-$F_1=0.58$. The most reliable gains come from calibration and ensembling: threshold tuning improves Social Focus vs. Personal Focus from $0.41$ to $0.57$ ($+0.16$), transformer soft voting lifts Growth from $0.286$ to $0.303$, and a Transformer+LLM hybrid reaches $0.353$ on Self-Protection. In contrast, hard hierarchical gating does not consistently improve the end task. Compact LLMs also underperform supervised encoders as stand-alone systems, although they sometimes add useful diversity in hybrid ensembles. Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.

View on arXiv PDF

Similar