Modelling Adjectival Modification Effects on Semantic Plausibility
This work addresses a gap in modeling plausibility changes for tasks like dialogue generation and commonsense reasoning, but it is incremental as it builds on existing benchmarks and methods.
The paper tackled the problem of assessing how adjectival modifiers affect semantic plausibility, using the ADEPT benchmark of 16K sentence pairs, and found that sentence transformers and transformer-based models struggle with the task, with sentence transformers underperforming compared to models like RoBERTa.
While the task of assessing the plausibility of events such as ''news is relevant'' has been addressed by a growing body of work, less attention has been paid to capturing changes in plausibility as triggered by event modification. Understanding changes in plausibility is relevant for tasks such as dialogue generation, commonsense reasoning, and hallucination detection as it allows to correctly model, for example, ''gentle sarcasm'' as a sign of closeness rather than unkindness among friends [9]. In this work, we tackle the ADEPT challenge benchmark [6] consisting of 16K English sentence pairs differing by exactly one adjectival modifier. Our modeling experiments provide a conceptually novel method by using sentence transformers, and reveal that both they and transformer-based models struggle with the task at hand, and sentence transformers - despite their conceptual alignment with the task - even under-perform in comparison to models like RoBERTa. Furthermore, an in-depth comparison with prior work highlights the importance of a more realistic, balanced evaluation method: imbalances distort model performance and evaluation metrics, and weaken result trustworthiness.