CLAILGSep 28, 2022

Multilingual Search with Subword TF-IDF

arXiv:2209.14281v22 citationsh-index: 11Has Code
AI Analysis

This provides a more accurate and language-agnostic search method for multilingual information retrieval, though it appears incremental as it builds on TF-IDF with subword tokenization.

The paper tackled the problem of multilingual search by proposing subword TF-IDF (STF-IDF), which achieved 85.4% accuracy for English and over 80% for 10 other languages on the XQuAD evaluation without heuristics-based preprocessing.

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of Text2Text: https://github.com/artitw/text2text

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes