CLJul 2, 2025

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

arXiv:2507.01764v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses data fidelity issues in corpus linguistics for researchers, but it is incremental as it focuses on specific preprocessing challenges.

The paper tackles the problem of tokenization discrepancies caused by emojis and homoglyphs in corpus linguistics, showing that preprocessing these elements is necessary to maintain data fidelity and ensure reliable linguistic analysis.

Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and technical aspects involved in digital textual data to enhance the accuracy of corpus analysis, and have significant implications for both quantitative and qualitative approaches in corpus-based research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes