Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages
This addresses information access for users of low-resource, high-variance dialects, but it is incremental as it builds on cross-lingual retrieval methods.
The paper tackled the problem of cross-dialect information retrieval (CDIR) by creating the first German dialect dataset, WikiDIR, and showed that lexical methods and zero-shot cross-lingual transfer perform poorly due to high lexical variation and low resources, while document translation effectively reduces the dialect gap.
A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.