CLSep 13, 2023

Native Language Identification with Big Bird Embeddings

Sergey Kramp, Giovanni Cassani, Chris Emmery

arXiv:2309.06923v10.5h-index: 8Has Code

Originality Incremental advance

AI Analysis

This provides a more effective and computationally efficient method for identifying an author's native language from their writing, addressing a domain-specific task in computational linguistics.

The paper tackled the problem of Native Language Identification (NLI) by showing that classifiers using Big Bird embeddings outperform traditional linguistic feature engineering models, achieving a large margin improvement on the Reddit-L2 dataset.

Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and transformer-based NLI models have thus far failed to offer effective, practical alternatives. The current work investigates if input size is a limiting factor, and shows that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.

View on arXiv PDF Code

Similar