MILE: A Mutation Testing Framework of In-Context Learning Systems
This work addresses the need for better testing methods in ICL systems, which is crucial for researchers and practitioners relying on these models, but it is incremental as it builds on existing testing techniques in machine learning.
The paper tackles the problem of evaluating the reliability and quality of test data for in-context learning (ICL) systems in large language models, which suffer from black-box mechanisms and sensitivity to example selection, by proposing a mutation testing framework with specialized operators and scores, showing effectiveness through comprehensive experiments.
In-context Learning (ICL) has achieved notable success in the applications of large language models (LLMs). By adding only a few input-output pairs that demonstrate a new task, the LLM can efficiently learn the task during inference without modifying the model parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at https://github.com/weizeming/MILE.