To Be or Not To Be a Verbal Multiword Expression: A Quest for Discriminating Features
This work addresses the challenge of VMWE identification for downstream semantic applications, but it is incremental as it focuses on a subproblem (seen VMWEs) and builds on existing shared task benchmarks.
The paper tackled the problem of identifying previously seen verbal multiword expressions (VMWEs) by determining an optimal set of features for supervised classification, and found that a simple custom frequency-based feature selection method outperformed standard methods, with an SVM classifier using only 6 features achieving better results than the best systems in a recent shared task on French data.
Automatic identification of mutiword expressions (MWEs) is a pre-requisite for semantically-oriented downstream applications. This task is challenging because MWEs, especially verbal ones (VMWEs), exhibit surface variability. However, this variability is usually more restricted than in regular (non-VMWE) constructions, which leads to various variability profiles. We use this fact to determine the optimal set of features which could be used in a supervised classification setting to solve a subproblem of VMWE identification: the identification of occurrences of previously seen VMWEs. Surprisingly, a simple custom frequency-based feature selection method proves more efficient than other standard methods such as Chi-squared test, information gain or decision trees. An SVM classifier using the optimal set of only 6 features outperforms the best systems from a recent shared task on the French seen data.