Eduardo Sánchez

CL
h-index35
9papers
197citations
Novelty41%
AI Score36

9 Papers

CLSep 18, 2024
Linguini: A benchmark for language-agnostic linguistic reasoning

Eduardo Sánchez, Belen Alastruey, Christophe Ropers et al.

We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.

CLJul 23, 2024
Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models

Kenza Benkirane, Laura Gongas, Shahar Pelles et al.

Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.

CLSep 6, 2023
Gender-specific Machine Translation with Large Language Models

Eduardo Sánchez, Pierre Andrews, Pontus Stenetorp et al.

While machine translation (MT) systems have seen significant improvements, it is still common for translations to reflect societal biases, such as gender bias. Decoder-only Large Language Models (LLMs) have demonstrated potential in MT, albeit with performance slightly lagging behind traditional encoder-decoder Neural Machine Translation (NMT) systems. However, LLMs offer a unique advantage: the ability to control the properties of the output through prompts. In this study, we leverage this flexibility to explore LLaMa's capability to produce gender-specific translations. Our results indicate that LLaMa can generate gender-specific translations with translation accuracy and gender bias comparable to NLLB, a state-of-the-art multilingual NMT system. Furthermore, our experiments reveal that LLaMa's gender-specific translations rely on coreference resolution to determine gender, showing higher gender variance in gender-ambiguous datasets but maintaining consistency in less ambiguous contexts. This research investigates the potential and challenges of using LLMs for gender-specific translations as an instance of the controllability of outputs offered by LLMs.

CLDec 11, 2024
Large Concept Models: Language Modeling in a Sentence Representation Space

LCM team, Loïc Barrault, Paul-Ambroise Duquenne et al.

LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.

CLDec 11, 2024
LCFO: Long Context and Long Form Output Dataset and Benchmarking

Marta R. Costa-jussà, Pierre Andrews, Mariano Coria Meglioli et al.

This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (~ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (~ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (~ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (~ 0.6).

CLFeb 6, 2025
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe et al.

BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages. Each of these source languages are representative of the most widely spoken ones and therefore they have the potential to serve as pivot languages that will enable more accurate translations. The dataset is multicentric to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for crowd-source extension for which we are launching a call aiming at collecting a multi-way parallel corpus covering any written language.

CLDec 11, 2024
Y-NQ: English-Yorùbá Evaluation dataset for Open-Book Reading Comprehension and Text Generation

Marta R. Costa-jussà, Joy Chen, Ifeoluwanimi Adebara et al.

The purpose of this work is to share an English-Yorùbá evaluation dataset for open-book reading comprehension and text generation to assess the performance of models both in a high- and a low- resource language. The dataset contains 358 questions and answers on 338 English documents and 208 Yorùbá documents. The average document length is ~ 10k words for English and 430 words for Yorùbá. Experiments show a consistent disparity in performance between the two languages, with Yorùbá falling behind English for automatic metrics even if documents are much shorter for this language. For a small set of documents with comparable length, performance of Yorùbá drops by x2.5 times. When analyzing performance by length, we observe that Yorùbá decreases performance dramatically for documents that reach 1500 words while English performance is barely affected at that length. Our dataset opens the door to showcasing if English LLM reading comprehension capabilities extend to Yorùbá, which for the evaluated LLMs is not the case.

CLSep 17, 2025
Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification

Samuel J. Bell, Eduardo Sánchez, David Dale et al.

Multilingual toxicity detection remains a significant challenge due to the scarcity of training data and resources for many languages. While prior work has leveraged the translate-test paradigm to support cross-lingual transfer across a range of classification tasks, the utility of translation in supporting toxicity detection at scale remains unclear. In this work, we conduct a comprehensive comparison of translation-based and language-specific/multilingual classification pipelines. We find that translation-based pipelines consistently outperform out-of-distribution classifiers in 81.3% of cases (13 of 16 languages), with translation benefits strongly correlated with both the resource level of the target language and the quality of the machine translation (MT) system. Our analysis reveals that traditional classifiers outperform large language model (LLM) judges, with this advantage being particularly pronounced for low-resource languages, where translate-classify methods dominate translate-judge approaches in 6 out of 7 cases. We additionally show that MT-specific fine-tuning on LLMs yields lower refusal rates compared to standard instruction-tuned models, but it can negatively impact toxicity detection accuracy for low-resource languages. These findings offer actionable guidance for practitioners developing scalable multilingual content moderation systems.

CLNov 12, 2024
On the Role of Speech Data in Reducing Toxicity Detection Bias

Samuel J. Bell, Mariano Coria Meglioli, Megan Richards et al.

Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-based biases are mitigated by speech-based systems, we produce a set of high-quality group annotations for the multilingual MuTox dataset, and then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions, particularly for ambiguous and disagreement-inducing samples. Our results also suggest that improving classifiers, rather than transcription pipelines, is more helpful for reducing group bias. We publicly release our annotations and provide recommendations for future toxicity dataset construction.