DLCLMMMay 9, 2025

Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models

arXiv:2505.06107v1h-index: 8ICWSM
Originality Incremental advance
AI Analysis

This addresses migration research challenges for scholars by providing a method to infer nationality from names, though it is incremental as it applies existing machine learning techniques to a specific domain problem.

The paper tackled the problem of differentiating emigration from return migration of scholars by developing name-based nationality detection models to address left-censoring in migration research, achieving weighted F1 scores of up to 84% and showing that using name origin instead of academic origin reveals higher return migration rates, such as 48% vs. 33% for the USA.

Most web and digital trace data do not include information about an individual's nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant's country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest and 67% for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars' full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin, in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods for addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes