Estimating Commonsense Plausibility through Semantic Shifts
This addresses the challenge of evaluating language models on commonsense plausibility, which is incremental as it introduces a new discriminative method for an existing bottleneck.
The paper tackled the problem of fine-grained commonsense plausibility estimation for language models by proposing ComPaSS, a discriminative framework that measures semantic shifts with augmentations, and it consistently outperformed baselines across tasks and backbones, including showing VLMs yield superior performance on vision-grounded tasks.
Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches--reliant on likelihoods or verbalized judgments--struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models' ability to capture semantic nuances, thereby further enhancing ComPaSS.