LeCov: Multi-level Testing Criteria for Large Language Models
This work addresses the problem of ensuring trustworthiness in LLMs for developers and users, but it is incremental as it builds on existing testing methods by formalizing criteria.
The paper tackles the lack of systematic testing criteria for large language models (LLMs) by proposing LeCov, a set of multi-level criteria based on internal components like attention and uncertainty, which improves test prioritization and coverage-guided testing, as shown in experiments on three models and four datasets.
Large Language Models (LLMs) are widely used in many different domains, but because of their limited interpretability, there are questions about how trustworthy they are in various perspectives, e.g., truthfulness and toxicity. Recent research has started developing testing methods for LLMs, aiming to uncover untrustworthy issues, i.e., defects, before deployment. However, systematic and formalized testing criteria are lacking, which hinders a comprehensive assessment of the extent and adequacy of testing exploration. To mitigate this threat, we propose a set of multi-level testing criteria, LeCov, for LLMs. The criteria consider three crucial LLM internal components, i.e., the attention mechanism, feed-forward neurons, and uncertainty, and contain nine types of testing criteria in total. We apply the criteria in two scenarios: test prioritization and coverage-guided testing. The experiment evaluation, on three models and four datasets, demonstrates the usefulness and effectiveness of LeCov.