CLDec 23, 2024

An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

arXiv:2412.17361v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses tokenizer selection for Japanese sentiment analysis, but it is incremental as it applies standard methods to a specific language task.

The study compared three Japanese tokenizers (MeCab, Sudachi, SentencePiece) for sentiment classification, finding that SentencePiece with TF-IDF and Logistic Regression achieved the best classification performance.

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes