CLLGJun 23, 2023

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

arXiv:2306.13651v227 citationsh-index: 72
Originality Incremental advance
AI Analysis

This addresses the need for reliable, scalable evaluation of LLMs in real-world deployments, such as client-facing chatbots, offering a complementary method to labeled data strategies.

The paper tackles the problem of evaluating large language models (LLMs) on realistic data by proposing a self-supervised framework that analyzes model sensitivity to input transformations, bypassing issues with small, labeled datasets. It demonstrates this approach for tasks like toxicity and knowledge, finding strong correlations with human-labeled benchmarks.

With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative. For example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. Current evaluations approach this problem using small, domain-specific datasets with human-curated labels. These evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. To bypass these drawbacks, we propose a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on the input text. Self-supervised evaluation can directly monitor LLM behavior on datasets collected in the wild or streamed during live model deployment. We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. When comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. The self-supervised paradigm complements current evaluation strategies that rely on labeled data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes