CLNov 7, 2023

SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions for Collocations in Spanish

Yevhen Kostiuk, Grigori Sidorov, Olga Kolesnikova

arXiv:2311.04189v10.5h-index: 31

Originality Synthesis-oriented

AI Analysis

This addresses the need for labeled data in NLP for Spanish collocation analysis, but it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the problem of hierarchical classification of lexical functions for Spanish verb-noun collocations by creating a dataset with 37 classes organized in a tree structure, providing baselines and data splits for evaluation.

In natural language processing (NLP), lexical function is a concept to unambiguously represent semantic and syntactic features of words and phrases in text first crafted in the Meaning-Text Theory. Hierarchical classification of lexical functions involves organizing these features into a tree-like hierarchy of categories or labels. This is a challenging task as it requires a good understanding of the context and the relationships among words and phrases in text. It also needs large amounts of labeled data to train language models effectively. In this paper, we present a dataset of most frequent Spanish verb-noun collocations and sentences where they occur, each collocation is assigned to one of 37 lexical functions defined as classes for a hierarchical classification task. Each class represents a relation between the noun and the verb in a collocation involving their semantic and syntactic features. We combine the classes in a tree-based structure, and introduce classification objectives for each level of the structure. The dataset was created by dependency tree parsing and matching of the phrases in Spanish news. We provide baselines and data splits for each objective.

View on arXiv PDF

Similar