CLFeb 22, 2018

LIDIOMS: A Multilingual Linked Idioms Data Set

arXiv:1802.08148v11095 citations
Originality Synthesis-oriented
AI Analysis

This provides a structured dataset for NLP researchers and developers working with idioms across languages, but it is incremental as it builds on existing data sources and linking practices.

The paper tackles the problem of supporting natural language processing applications by creating LIDIOMS, a multilingual linked idioms dataset in RDF format covering five languages, with the result being a quality-assured resource linked to existing datasets like BabelNet.

In this paper, we describe the LIDIOMS data set, a multilingual RDF representation of idioms currently containing five languages: English, German, Italian, Portuguese, and Russian. The data set is intended to support natural language processing applications by providing links between idioms across languages. The underlying data was crawled and integrated from various sources. To ensure the quality of the crawled data, all idioms were evaluated by at least two native speakers. Herein, we present the model devised for structuring the data. We also provide the details of linking LIDIOMS to well-known multilingual data sets such as BabelNet. The resulting data set complies with best practices according to Linguistic Linked Open Data Community.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes