CoAM: Corpus of All-Type Multiword Expressions
This provides a more reliable dataset for MWE identification, which is important for NLP tasks like machine translation, though it is incremental as it builds on existing data collection methods.
The authors tackled the problem of inconsistent and limited datasets for multiword expression (MWE) identification by creating CoAM, a new dataset of 1.3K sentences with enhanced quality through human annotation and automated checking, which includes MWE type tags for fine-grained analysis. They found that a fine-tuned large language model outperformed the previous state-of-the-art method on the DiMSUM dataset, and analysis revealed Verb MWEs are easier to identify than Noun MWEs.
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. Additionally, for the first time in a dataset of MWE identification, CoAM's MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form. Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.