CLApr 23, 2024

Identifying Fairness Issues in Automatically Generated Testing Content

Kevin Stowe, Benny Longwill, Alyssa Francis, Tatsuya Aoyama, Debanjan Ghosh, Swapna Somasundaran

arXiv:2404.15104v214.427 citationsh-index: 18BEA

Originality Incremental advance

AI Analysis

This addresses fairness problems in standardized testing content generation, which is crucial for ensuring test validity and equity, though it is incremental as it builds on existing methods for bias detection.

The study tackled fairness issues in automatically generated test content for a large-scale English proficiency test, identifying content that could unfairly impact scores, and found that combining prompt self-correction with few-shot learning achieved an F1 score of 0.79 on a held-out test set.

Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically, we review test content generated for a large-scale standardized English proficiency test with the goal of identifying content that only pertains to a certain subset of the test population as well as content that has the potential to be upsetting or distracting to some test takers. Issues like these could inadvertently impact a test taker's score and thus should be avoided. This kind of content does not reflect the more commonly-acknowledged biases, making it challenging even for modern models that contain safeguards. We build a dataset of 601 generated texts annotated for fairness and explore a variety of methods for classification: fine-tuning, topic-based classification, and prompting, including few-shot and self-correcting prompts. We find that combining prompt self-correction and few-shot learning performs best, yielding an F1 score of 0.79 on our held-out test set, while much smaller BERT- and topic-based models have competitive performance on out-of-domain data.

View on arXiv PDF

Similar