CLSep 14, 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow InstructionsFederico Bianchi, Mirac Suzgun, Giuseppe Attanasio et al. · stanford
Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.
CLAug 2, 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language ModelsPaul Röttger, Hannah Rose Kirk, Bertie Vidgen et al. · oxford, stanford
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.
IRApr 8, 2022
Contrastive language and vision learning of general fashion conceptsPatrick John Chia, Giuseppe Attanasio, Federico Bianchi et al. · stanford
The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release our model and code to the community.
90.8CLJun 4
Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication ScenariosGiuseppe Attanasio, Beatrice Savoldi, Daniel Chechelnitsky et al.
Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.
CLOct 13, 2022
Is It Worth the (Environmental) Cost? Limited Evidence for Temporal Adaptation via Continuous TrainingGiuseppe Attanasio, Debora Nozza, Federico Bianchi et al. · stanford
Language is constantly changing and evolving, leaving language models to become quickly outdated. Consequently, we should continuously update our models with new data to expose them to new events and facts. However, that requires additional computing, which means new carbon emissions. Do any measurable benefits justify this cost? This paper looks for empirical evidence to support continuous training. We reproduce existing benchmarks and extend them to include additional time periods, models, and tasks. Our results show that the downstream task performance of temporally adapted English models for social media data do not improve over time. Pretrained models without temporal adaptation are actually significantly more effective and efficient. However, we also note a lack of suitable temporal benchmarks. Our findings invite a critical reflection on when and how to temporally adapt language models, accounting for sustainability.
CLMar 17, 2022
Entropy-based Attention Regularization Frees Unintended Bias Mitigation from ListsGiuseppe Attanasio, Debora Nozza, Dirk Hovy et al.
Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. Most mitigation techniques use lists of identity terms or samples from the target domain during training. However, this approach requires a-priori knowledge and introduces further bias if important terms are neglected. Instead, we propose a knowledge-free Entropy-based Attention Regularization (EAR) to discourage overfitting to training-specific terms. An additional objective function penalizes tokens with low self-attention entropy. We fine-tune BERT via EAR: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian. EAR also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.
CLJun 14, 2023
ITALIC: An Italian Intent Classification DatasetAlkis Koudounas, Moreno La Quatra, Lorenzo Vaiani et al.
Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Italian regions and annotated with intent labels and additional metadata. We explore the versatility of ITALIC by evaluating current state-of-the-art speech and text models. Results on intent classification suggest that increasing scale and running language adaptation yield better speech models, monolingual text models outscore multilingual ones, and that speech recognition on ITALIC is more challenging than on existing Italian benchmarks. We release both the dataset and the annotation scheme to streamline the development of new Italian SLU models and language-specific datasets.
CLSep 14, 2023
Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic FeaturesEliana Pastor, Alkis Koudounas, Giuseppe Attanasio et al.
Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We generate easy-to-interpret explanations via input perturbation on two information levels. 1) Word-level explanations reveal how each word-related audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody and background noise) answer the counterfactual: ``What would the model prediction be if we edited the audio signal in this way?'' We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian. Our findings demonstrate that the explanations are faithful to the model's inner workings and plausible to humans. Our method and findings pave the way for future research on interpreting speech models.
CLOct 18, 2023
A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine TranslationGiuseppe Attanasio, Flor Miriam Plaza-del-Arco, Debora Nozza et al.
Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereotypes and prejudices. In this work, we address this gap by investigating whether and to what extent such models exhibit gender bias in machine translation and how we can mitigate it. Concretely, we compute established gender bias metrics on the WinoMT corpus from English to German and Spanish. We discover that IFT models default to male-inflected translations, even disregarding female occupational stereotypes. Next, using interpretability methods, we unveil that models systematically overlook the pronoun indicating the gender of a target occupation in misgendered translations. Finally, based on this finding, we propose an easy-to-implement and effective bias mitigation solution based on few-shot learning that leads to significantly fairer translations.
CLAug 2, 2022
ferret: a Framework for Benchmarking Explainers on TransformersGiuseppe Attanasio, Eliana Pastor, Chiara Di Bonaventura et al.
As Transformers are increasingly relied upon to solve complex NLP problems, there is an increased need for their decisions to be humanly interpretable. While several explainable AI (XAI) techniques for interpreting the outputs of transformer-based models have been proposed, there is still a lack of easy access to using and comparing them. We introduce ferret, a Python library to simplify the use and comparisons of XAI methods on transformer-based classifiers. With ferret, users can visualize and compare transformers-based models output explanations using state-of-the-art XAI methods on any free-text or existing XAI corpora. Moreover, users can also evaluate ad-hoc XAI metrics to select the most faithful and plausible explanations. To align with the recently consolidated process of sharing and using transformers-based models from Hugging Face, ferret interfaces directly with its Python library. In this paper, we showcase ferret to benchmark XAI methods used on transformers for sentiment analysis and hate speech detection. We show how specific methods provide consistently better explanations and are preferable in the context of transformer models.
CLSep 5, 2023
Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention RegularizationHelena Bonaldi, Giuseppe Attanasio, Debora Nozza et al.
Recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting Pretrained Transformer-based Language Models (PLMs) with human-curated data. This process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targets or to real-world toxic language. This paper introduces novel attention regularization methodologies to improve the generalization capabilities of PLMs for counter narratives generation. Overfitting to training-specific terms is then discouraged, resulting in more diverse and richer narratives. We experiment with two attention-based regularization techniques on a benchmark English dataset. Regularized models produce better counter narratives than state-of-the-art approaches in most cases, both in terms of automatic metrics and human evaluation, especially when hateful targets are not present in the training data. This work paves the way for better and more flexible counter-speech generation models, a task for which datasets are highly challenging to produce.
CLDec 4, 2025
Challenging the Abilities of Large Language Models in Italian: a Community InitiativeMalvina Nissim, Danilo Croce, Viviana Patti et al.
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
CLDec 3, 2025
Generative AI Practices, Literacy, and Divides: An Empirical Analysis in the Italian ContextBeatrice Savoldi, Giuseppe Attanasio, Olga Gorodetskaya et al.
The rise of Artificial Intelligence (AI) language technologies, particularly generative AI (GenAI) chatbots accessible via conversational interfaces, is transforming digital interactions. While these tools hold societal promise, they also risk widening digital divides due to uneven adoption and low awareness of their limitations. This study presents the first comprehensive empirical mapping of GenAI adoption, usage patterns, and literacy in Italy, based on newly collected survey data from 1,906 Italian-speaking adults. Our findings reveal widespread adoption for both work and personal use, including sensitive tasks like emotional support and medical advice. Crucially, GenAI is supplanting other technologies to become a primary information source: this trend persists despite low user digital literacy, posing a risk as users struggle to recognize errors or misinformation. Moreover, we identify a significant gender divide -- particularly pronounced in older generations -- where women are half as likely to adopt GenAI and use it less frequently than men. While we find literacy to be a key predictor of adoption, it only partially explains this disparity, suggesting that other barriers are at play. Overall, our data provide granular insights into the multipurpose usage of GenAI, highlighting the dual need for targeted educational initiatives and further investigation into the underlying barriers to equitable participation that competence alone cannot explain.
CLFeb 28, 2024Code
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance GapsGiuseppe Attanasio, Beatrice Savoldi, Dennis Fucci et al.
Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models. Surprisingly, those gaps are not explained by acoustic or lexical properties. However, probing internal model states reveals a correlation with gendered performance gap. That is, the easier it is to distinguish speaker gender in a language using probes, the more the gap reduces, favoring female speakers. Our results show that gender disparities persist even in state-of-the-art models. Our findings have implications for the improvement of multilingual ASR systems, underscoring the importance of accessibility to training data and nuanced evaluation to predict and mitigate gender gaps. We release all code and artifacts at https://github.com/g8a9/multilingual-asr-gender-gap.
89.0CLMar 27
AMALIA Technical Report: A Fully Open Source Large Language Model for European PortugueseAfonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos et al.
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
CLJun 10, 2024Code
Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into GermanManuel Lardelli, Giuseppe Attanasio, Anne Lauscher
The translation of gender-neutral person-referring terms (e.g., the students) is often non-trivial. Translating from English into German poses an interesting case -- in German, person-referring nouns are usually gender-specific, and if the gender of the referent(s) is unknown or diverse, the generic masculine (die Studenten (m.)) is commonly used. This solution, however, reduces the visibility of other genders, such as women and non-binary people. To counteract gender discrimination, a societal movement towards using gender-fair language exists (e.g., by adopting neosystems). However, gender-fair German is currently barely supported in machine translation (MT), requiring post-editing or manual translations. We address this research gap by studying gender-fair language in English-to-German MT. Concretely, we enrich a community-created gender-fair language dictionary and sample multi-sentence test instances from encyclopedic text and parliamentary speeches. Using these novel resources, we conduct the first benchmark study involving two commercial systems and six neural MT models for translating words in isolation and natural contexts across two domains. Our findings show that most systems produce mainly masculine forms and rarely gender-neutral variants, highlighting the need for future research. We release code and data at https://github.com/g8a9/building-bridges-gender-fair-german-mt.
89.0CLMay 7
SEQUOR: A Multi-Turn Benchmark for Realistic Constraint FollowingBeatriz Canaverde, Duarte M. Alves, José Pombal et al.
In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.
CLMar 7, 2024
Classist Tools: Social Class Correlates with Performance in NLPAmanda Cercas Curry, Giuseppe Attanasio, Zeerak Talat et al.
Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age and gender are somewhat well represented, but Labov's original target, socioeconomic status, is noticeably absent. And yet it matters. We show empirically that NLP disadvantages less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant performance disparities that can be attributed to socioeconomic status as well as ethnicity and geographical differences. With NLP technologies becoming ever more ubiquitous and quotidian, they must accommodate all language varieties to avoid disadvantaging already marginalised groups. We argue for the inclusion of socioeconomic class in future language technologies.
CLOct 14, 2024
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality EstimationEmmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal et al.
Quality estimation (QE)-the automatic assessment of translation quality-has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. While QE metrics have been optimized to align with human judgments, whether they encode social biases has been largely overlooked. Biased QE risks favoring certain demographic groups over others, e.g., by exacerbating gaps in visibility and usability. This paper defines and investigates gender bias of QE metrics and discusses its downstream implications for machine translation (MT). Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. When a human entity's gender in the source is undisclosed, masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Even when contextual cues disambiguate gender, using context-aware QE metrics leads to more errors in selecting the correct translation inflection for feminine referents than for masculine ones. Moreover, a biased QE metric affects data filtering and quality-aware decoding. Our findings underscore the need for a renewed focus on developing and evaluating QE metrics centered on gender.
CLJan 17, 2025
MSTS: A Multimodal Safety Test Suite for Vision-Language ModelsPaul Röttger, Giuseppe Attanasio, Felix Friedrich et al.
Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.
CLJan 16, 2025
Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTEBeatrice Savoldi, Giuseppe Attanasio, Eleonora Cupin et al.
Avoiding the propagation of undue (binary) gender inferences and default masculine language remains a key challenge towards inclusive multilingual technologies, particularly when translating into languages with extensive gendered morphology. Gender-neutral translation (GNT) represents a linguistic strategy towards fairer communication across languages. However, research on GNT is limited to a few resources and language pairs. To address this gap, we introduce mGeNTE, an expert-curated resource, and use it to conduct the first systematic multilingual evaluation of inclusive translation with state-of-the-art instruction-following language models (LMs). Experiments on en-es/de/it/el reveal that while models can recognize when neutrality is appropriate, they cannot consistently produce neutral translations, limiting their usability. To probe this behavior, we enrich our evaluation with interpretability analyses that identify task-relevant features and offer initial insights into the internal dynamics of LM-based GNT.
75.9CYApr 1
Translating With Feeling: Centering Translator Perspectives within Translation TechnologiesDaniel Chechelnitsky, Sireesh Gururaja, Seyi Olojo et al.
Rapid development of Large Language Models (LLMs) and similar automated approaches for translation tasks is increasingly affecting the landscape of translation technologies. As concerns about the outsourcing of translator work to these automated translation tools grow, it becomes increasingly crucial to gather insights from the translation community directly. To this end, we conduct an interview study with 19 professional translators working across 11 languages and 11 domains to understand their perspectives, experiences, and concerns with using translation technologies in their work. We find that translators are cautious when incorporating new tools into their workflow, with several expressing concerns machine translation (MT) and LLMs are infringing on the necessary human aspects and verification steps of translation, worried that these tools have potential for harmful downstream effects due to compromising the human aspect of translation work. These findings demonstrate the need to develop translation technologies that directly serve translators' needs rather than replacing human translation. This can be done by focusing more on the assistive, rather than the automating aspects of these tools.
CLOct 8, 2025
GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation MetricsGiorgos Filandrianos, Orfeas Menis Mastromichalakis, Wafaa Mohammed et al.
Gender bias in machine translation (MT) systems has been extensively documented, but bias in automatic quality estimation (QE) metrics remains comparatively underexplored. Existing studies suggest that QE metrics can also exhibit gender bias, yet most analyses are limited by small datasets, narrow occupational coverage, and restricted language variety. To address this gap, we introduce a large-scale challenge set specifically designed to probe the behavior of QE metrics when evaluating translations containing gender-ambiguous occupational terms. Building on the GAMBIT corpus of English texts with gender-ambiguous occupations, we extend coverage to three source languages that are genderless or natural-gendered, and eleven target languages with grammatical gender, resulting in 33 source-target language pairs. Each source text is paired with two target versions differing only in the grammatical gender of the occupational term(s) (masculine vs. feminine), with all dependent grammatical elements adjusted accordingly. An unbiased QE metric should assign equal or near-equal scores to both versions. The dataset's scale, breadth, and fully parallel design, where the same set of texts is aligned across all languages, enables fine-grained bias analysis by occupation and systematic comparisons across languages.
CLJun 20, 2025
Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text LearningGiuseppe Attanasio, Sonal Sannigrahi, Ben Peters et al.
This paper presents the IT-IST submission to the IWSLT 2025 Shared Task on Instruction Following Speech Processing. We submit results for the Short Track, i.e., speech recognition, translation, and spoken question answering. Our model is a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder through a first phase of modality alignment and a second phase of instruction fine-tuning. Crucially, we focus on using small-scale language model backbones (< 2B) and restrict to high-quality, CC-BY data along with synthetic data generation to supplement existing resources.
CLJun 2, 2025
Different Speech Translation Models Encode and Translate Speaker Gender DifferentlyDennis Fucci, Marco Gaido, Matteo Negri et al.
Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker's gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender encoding across diverse ST models. Results on three language directions (English-French/Italian/Spanish) indicate that while traditional encoder-decoder models capture gender information, newer architectures -- integrating a speech encoder with a machine translation system via adapters -- do not. We also demonstrate that low gender encoding capabilities result in systems' tendency toward a masculine default, a translation bias that is more pronounced in newer architectures.
CLAug 19, 2021
Contrastive Language-Image Pre-training for the Italian LanguageFederico Bianchi, Giuseppe Attanasio, Raphael Pisoni et al.
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.