97.6CLMar 17
Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and SpeechOmnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot et al. · meta-ai
Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.
CLAug 23, 2024
Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource TranslationVivek Iyer, Bhavitvya Malik, Pavel Stepachev et al.
Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource languages (LRLs) still lags significantly behind Neural Machine Translation (NMT) models. In this work, we explore what it would take to adapt LLMs for the low-resource setting. Particularly, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has seen reduced use in adapting LLMs for MT, while data diversity has been embraced to promote transfer across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both considerations: a) parallel data is critical during both pre-training and SFT; b) diversity tends to cause interference instead of transfer. Our experiments with three LLMs across two low-resourced language groups -- Indigenous American and North-East Indian -- reveal consistent trends, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve LRLs.
CLSep 20, 2023
Towards Effective Disambiguation for Machine Translation with Large Language ModelsVivek Iyer, Pinzhen Chen, Alexandra Birch
Resolving semantic ambiguity has long been recognised as a central challenge in the field of Machine Translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to handle many such cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate "ambiguous sentences" - i.e. those containing highly polysemous words and/or rare word senses. We also propose two ways to improve their disambiguation capabilities, through a) in-context learning and b) fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs to become better disambiguators during Machine Translation. We release our curated disambiguation corpora and resources at https://data.statmt.org/ambiguous-europarl.
CLOct 21, 2023
Code-Switching with Word Senses for Pretraining in Neural Machine TranslationVivek Iyer, Edoardo Barba, Alexandra Birch et al.
Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic "code-switched" text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage -- leading to harmful sense biases in the pretraining data that are subsequently inherited by the resulting models. In this work, we introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality. Then, we show the robustness of our approach to scale to various challenging data and resource-scarce scenarios and, finally, report fine-grained accuracy improvements on the DiBiMT disambiguation benchmark. Our studies yield interesting and novel insights into the merits and challenges of integrating word sense information and structured knowledge in multilingual pretraining for NMT.
CLAug 24, 2024
Cultural Adaptation of Menus: A Fine-Grained ApproachZhonghe Zhang, Xiaoyu He, Vivek Iyer et al.
Machine Translation of Culture-Specific Items (CSIs) poses significant challenges. Recent work on CSI translation has shown some success using Large Language Models (LLMs) to adapt to different languages and cultures; however, a deeper analysis is needed to examine the benefits and pitfalls of each method. In this paper, we introduce the ChineseMenuCSI dataset, the largest for Chinese-English menu corpora, annotated with CSI vs Non-CSI labels and a fine-grained test set. We define three levels of CSI figurativeness for a more nuanced analysis and develop a novel methodology for automatic CSI identification, which outperforms GPT-based prompts in most categories. Importantly, we are the first to integrate human translation theories into LLM-driven translation processes, significantly improving translation accuracy, with COMET scores increasing by up to 7 points.
CLOct 20, 2022
The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)Faheem Kirefu, Vivek Iyer, Pinzhen Chen et al.
The University of Edinburgh participated in the WMT22 shared task on code-mixed translation. This consists of two subtasks: i) generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English. As both subtasks are considered low-resource, we focused our efforts on careful data generation and curation, especially the use of backtranslation from monolingual resources. For subtask 1 we explored the effects of constrained decoding on English and transliterated subwords in order to produce Hinglish. For subtask 2, we investigated different pretraining techniques, namely comparing simple initialisation from existing machine translation models and aligned augmentation. For both subtasks, we found that our baseline systems worked best. Our systems for both subtasks were one of the overall top-performing submissions.
CLDec 18, 2024Code
Towards Automatic Evaluation for Image TranscreationSimran Khanuja, Vivek Iyer, Claire He et al.
Beyond conventional paradigms of translating speech and text, recently, there has been interest in automated transcreation of images to facilitate localization of visual content across different cultures. Attempts to define this as a formal Machine Learning (ML) problem have been impeded by the lack of automatic evaluation mechanisms, with previous work relying solely on human evaluation. In this paper, we seek to close this gap by proposing a suite of automatic evaluation metrics inspired by machine translation (MT) metrics, categorized into: a) Object-based, b) Embedding-based, and c) VLM-based. Drawing on theories from translation studies and real-world transcreation practices, we identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity, and design our metrics to evaluate systems along these axes. Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity. Meta-evaluation across 7 countries shows our metrics agree strongly with human ratings, with average segment-level correlations ranging from 0.55-0.87. Finally, through a discussion of the merits and demerits of each metric, we offer a robust framework for automated image transcreation evaluation, grounded in both theoretical foundations and practical application. Our code can be found here: https://github.com/simran-khanuja/automatic-eval-img-transcreation.
CLFeb 8, 2021Code
VeeAlign: Multifaceted Context Representation using Dual Attention for Ontology AlignmentVivek Iyer, Arvind Agarwal, Harshit Kumar
Ontology Alignment is an important research problem applied to various fields such as data integration, data transfer, data preparation, etc. State-of-the-art (SOTA) Ontology Alignment systems typically use naive domain-dependent approaches with handcrafted rules or domain-specific architectures, making them unscalable and inefficient. In this work, we propose VeeAlign, a Deep Learning based model that uses a novel dual-attention mechanism to compute the contextualized representation of a concept which, in turn, is used to discover alignments. By doing this, not only is our approach able to exploit both syntactic and semantic information encoded in ontologies, it is also, by design, flexible and scalable to different domains with minimal effort. We evaluate our model on four different datasets from different domains and languages, and establish its superiority through these results as well as detailed ablation studies. The code and datasets used are available at https://github.com/Remorax/VeeAlign.
CLMar 29, 2025
XL-Suite: Cross-Lingual Synthetic Training and Evaluation Data for Open-Ended GenerationVivek Iyer, Pinzhen Chen, Ricardo Rei et al.
Cross-lingual open-ended generation - responding in a language different from that of the query - is an important yet understudied problem. This work proposes XL-Instruct, a novel technique for generating high-quality synthetic data, and introduces XL-AlpacaEval, a new benchmark for evaluating cross-lingual generation capabilities of large language models (LLMs). Our experiments show that fine-tuning with just 8K instructions generated using XL-Instruct significantly improves model performance, increasing the win rate against GPT-4o-Mini from 7.4% to 21.5% and improving on several fine-grained quality metrics. Moreover, base LLMs fine-tuned on XL-Instruct exhibit strong zero-shot improvements to question answering in the same language, as shown on our machine-translated m-AlpacaEval. These consistent gains highlight the promising role of XL-Instruct in the post-training of multilingual LLMs. Finally, we publicly release XL-Suite, a collection of training and evaluation data to facilitate research in cross-lingual open-ended generation.
CLJun 10, 2024
mHuBERT-147: A Compact Multilingual HuBERT ModelMarcely Zanon Boito, Vivek Iyer, Nikolaos Lagos et al.
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
CLDec 16, 2021
A Deep Learning Approach for Ontology Enrichment from Unstructured TextLalit Mohan Sanagavarapu, Vivek Iyer, Raghu Reddy
Information Security in the cyber world is a major cause for concern, with a significant increase in the number of attack surfaces. Existing information on vulnerabilities, attacks, controls, and advisories available on the web provides an opportunity to represent knowledge and perform security analytics to mitigate some of the concerns. Representing security knowledge in the form of ontology facilitates anomaly detection, threat intelligence, reasoning and relevance attribution of attacks, and many more. This necessitates dynamic and automated enrichment of information security ontologies. However, existing ontology enrichment algorithms based on natural language processing and ML models have issues with contextual extraction of concepts in words, phrases, and sentences. This motivates the need for sequential Deep Learning architectures that traverse through dependency paths in text and extract embedded vulnerabilities, threats, controls, products, and other security-related concepts and instances from learned path representations. In the proposed approach, Bidirectional LSTMs trained on a large DBpedia dataset and Wikipedia corpus of 2.8 GB along with Universal Sentence Encoder is deployed to enrich ISO 27001-based information security ontology. The model is trained and tested on a high-performance computing (HPC) environment to handle Wiki text dimensionality. The approach yielded a test accuracy of over 80% when tested with knocked-out concepts from ontology and web page instances to validate the robustness.
AIOct 16, 2020
Multifaceted Context Representation using Dual Attention for Ontology AlignmentVivek Iyer, Arvind Agarwal, Harshit Kumar
Ontology Alignment is an important research problem that finds application in various fields such as data integration, data transfer, data preparation etc. State-of-the-art (SOTA) architectures in Ontology Alignment typically use naive domain-dependent approaches with handcrafted rules and manually assigned values, making them unscalable and inefficient. Deep Learning approaches for ontology alignment use domain-specific architectures that are not only in-extensible to other datasets and domains, but also typically perform worse than rule-based approaches due to various limitations including over-fitting of models, sparsity of datasets etc. In this work, we propose VeeAlign, a Deep Learning based model that uses a dual-attention mechanism to compute the contextualized representation of a concept in order to learn alignments. By doing so, not only does our approach exploit both syntactic and semantic structure of ontologies, it is also, by design, flexible and scalable to different domains with minimal effort. We validate our approach on various datasets from different domains and in multilingual settings, and show its superior performance over SOTA methods.