CLMay 14, 2024

Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models

arXiv:2405.08497v14 citationsh-index: 5Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a specific challenge in NLP for handling non-compositional language, but it is incremental as it focuses on dataset creation and optimization strategies.

The paper tackled the problem of processing idiomatic expressions in language models by exploring trade-offs between data quantity and quality for idiomaticity detection, finding that quality is more important for context-enriched models while quantity matters for models without context.

Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but that quantity also plays a role in models without context inclusion strategies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes