Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
It addresses the problem of reliable analysis of global conversations for researchers and practitioners in natural language processing, but is incremental as it compares existing methods on a specific case study.
This study tackled the challenge of analyzing multilingual social media data by evaluating four cross-lingual classification approaches to filter relevant content from noisy keyword-based collections, using a dataset of over nine million tweets in English, Japanese, Hindi, and Korean from 2013 to 2022, and found key trade-offs between translation and multilingual methods for topic discovery.
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.