CLJul 1, 2024

Preserving Multilingual Quality While Tuning Query Encoder on English Only

arXiv:2407.00923v412 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of maintaining model versatility during domain-specific tuning for retrieval systems, offering a practical solution for preserving multilingual capabilities without overhauling embeddings.

The study investigated whether tuning a query encoder on a narrow English-only dataset degrades its original multilingual and general qualities, finding that tuning not only preserves but can improve these qualities, with improvements observed on distinctly different data.

A query encoder of a dual passage retrieval system can be tuned for specific types of queries or domains, while the precomputed and stored documents representations are kept intact. Switching from one query encoder to another when needed is easily feasible, unlike overhauling the embeddings of a whole knowledge base. In this work we raise a question: Can the generic, original qualities of the encoder be preserved or at least left not too degraded when it is tuned on a narrow domain? We conducted experiments on a high quality multilingual embedding model: Tuning it on a single English-only dataset, we observe that the tuning not only preserves the multilingual qualities, but even improves them. The embedding qualities on distinctly different data are also improved or at least preserved. Drawing on our observations, we suggest a more general hypothesis: Tuning with intentionally low learning rate can preserve or improve a system's properties acquired in training, but not specifically targeted by tuning. We call this adiabatic tuning and provide tentative explanations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes