CLMar 1, 2023

How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks

arXiv:2303.00293v1112 citationsh-index: 70
Originality Synthesis-oriented
AI Analysis

This work identifies key robustness challenges for GPT-3.5, which is crucial for improving its stability and trustworthiness in real-world AI applications, though it is incremental as it builds on existing evaluation methods.

The study assessed GPT-3.5's robustness on 21 datasets across 9 NLU tasks using 66 text transformations, finding that while it outperforms fine-tuned models on some tasks, it suffers significant performance drops, such as up to 35.74% in natural language inference and 43.59% in sentiment analysis.

The GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks, showcasing their strong understanding and reasoning capabilities. However, their robustness and abilities to handle various complexities of the open world have yet to be explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3.5, exploring its robustness using 21 datasets (about 116K test samples) with 66 text transformations from TextFlint that cover 9 popular Natural Language Understanding (NLU) tasks. Our findings indicate that while GPT-3.5 outperforms existing fine-tuned models on some tasks, it still encounters significant robustness degradation, such as its average performance dropping by up to 35.74\% and 43.59\% in natural language inference and sentiment analysis tasks, respectively. We also show that GPT-3.5 faces some specific robustness challenges, including robustness instability, prompt sensitivity, and number sensitivity. These insights are valuable for understanding its limitations and guiding future research in addressing these challenges to enhance GPT-3.5's overall performance and generalization abilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes