Filip Trhlik

CL
h-index32
4papers
188citations
Novelty48%
AI Score44

4 Papers

CLMay 13, 2024Code
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

Liam Dugan, Alyssa Hwang, Filip Trhlik et al.

Many commercial and open-source models claim to detect machine-generated text with extremely high accuracy (99% or more). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging-lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our data along with a leaderboard to encourage future research.

CLJan 14
Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing

Filip Trhlik, Andrew Caines, Paula Buttery

Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.

AIJan 13, 2025
Rethinking AI Cultural Alignment

Michal Bravansky, Filip Trhlik, Fazl Barez

As general-purpose artificial intelligence (AI) systems become increasingly integrated with diverse human communities, cultural alignment has emerged as a crucial element in their deployment. Most existing approaches treat cultural alignment as one-directional, embedding predefined cultural values from standardized surveys and repositories into AI systems. To challenge this perspective, we highlight research showing that humans' cultural values must be understood within the context of specific AI systems. We then use a GPT-4o case study to demonstrate that AI systems' cultural alignment depends on how humans structure their interactions with the system. Drawing on these findings, we argue that cultural alignment should be reframed as a bidirectional process: rather than merely imposing standardized values on AIs, we should query the human cultural values most relevant to each AI-based system and align it to these values through interaction frameworks shaped by human users.

CLJun 16, 2024
Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles

Filip Trhlik, Pontus Stenetorp

Large language models (LLMs) are increasingly being utilised across a range of tasks and domains, with a burgeoning interest in their application within the field of journalism. This trend raises concerns due to our limited understanding of LLM behaviour in this domain, especially with respect to political bias. Existing studies predominantly focus on LLMs undertaking political questionnaires, which offers only limited insights into their biases and operational nuances. To address this gap, our study establishes a new curated dataset that contains 2,100 human-written articles and utilises their descriptions to generate 56,700 synthetic articles using nine LLMs. This enables us to analyse shifts in properties between human-authored and machine-generated articles, with this study focusing on political bias, detecting it using both supervised models and LLMs. Our findings reveal significant disparities between base and instruction-tuned LLMs, with instruction-tuned models exhibiting consistent political bias. Furthermore, we are able to study how LLMs behave as classifiers, observing their display of political bias even in this role. Overall, for the first time within the journalistic domain, this study outlines a framework and provides a structured dataset for quantifiable experiments, serving as a foundation for further research into LLM political bias and its implications.