CLJul 22, 2017

Predicting the Gender of Indonesian Names

arXiv:1707.07129v20.71 citations

Originality Synthesis-oriented

AI Analysis

This addresses a practical problem for applications needing gender inference in Indonesia, where naming conventions differ, but it is incremental as it applies an existing method to a specific dataset.

The paper tackled predicting gender from Indonesian names using a character-level LSTM, achieving 92.25% accuracy with full names and 90.65% with first names, outperforming classical machine learning methods like Naive Bayes and XGBoost.

We investigated a way to predict the gender of a name using character-level Long-Short Term Memory (char-LSTM). We compared our method with some conventional machine learning methods, namely Naive Bayes, logistic regression, and XGBoost with n-grams as the features. We evaluated the models on a dataset consisting of the names of Indonesian people. It is not common to use a family name as the surname in Indonesian culture, except in some ethnicities. Therefore, we inferred the gender from both full names and first names. The results show that we can achieve 92.25% accuracy from full names, while using first names only yields 90.65% accuracy. These results are better than the ones from applying the classical machine learning algorithms to n-grams.

View on arXiv PDF

Similar