CLDec 5, 2020

Codeswitched Sentence Creation using Dependency Parsing

arXiv:2012.02990v13 citations
AI Analysis

This work tackles the problem of limited codeswitched data for multilingual NLP tasks, particularly for languages prevalent in India, which is a significant bottleneck for researchers and developers in this domain.

This paper addresses the scarcity of codeswitched data by proposing a novel algorithm that leverages English grammar's syntactic structure to generate grammatically sound codeswitched sentences for English-Hindi, English-Marathi, and English-Kannada language pairs. The method ensures abundant data generation from small initial datasets and is evaluated using qualitative metrics and baseline NLP task results.

Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) tasks. We thus present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data. Apart from maintaining the grammatical sanity to a great extent, our methodology also guarantees abundant generation of data from a minuscule snapshot of given data. We use multiple datasets to showcase the capabilities of our algorithm while at the same time we assess the quality of generated Codeswitched data using some qualitative metrics along with providing baseline results for couple of NLP tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes