CL AI CY LGNov 22, 2018

Creating a contemporary corpus of similes in Serbian by using natural language processing

arXiv:1811.10422v12 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for a contemporary corpus of similes in Serbian, which is important for linguistic and cultural heritage preservation, though it is incremental as it builds on prior data.

The researchers tackled the problem of collecting similes in Serbian by developing a semi-automated methodology using text mining and machine learning to gather 442 new similes from the web, expanding an existing corpus from 333 to 787 unique similes.

Simile is a figure of speech that compares two things through the use of connection words, but where comparison is not intended to be taken literally. They are often used in everyday communication, but they are also a part of linguistic cultural heritage. In this paper we present a methodology for semi-automated collection of similes from the World Wide Web using text mining and machine learning techniques. We expanded an existing corpus by collecting 442 similes from the internet and adding them to the existing corpus collected by Vuk Stefanovic Karadzic that contained 333 similes. We, also, introduce crowdsourcing to the collection of figures of speech, which helped us to build corpus containing 787 unique similes.

View on arXiv PDF

Similar