Revisiting Acceptability Judgements
This work addresses the lack of linguistic acceptability resources for non-Indo-European languages, providing a benchmark for evaluating and improving language models in Chinese, though it is incremental in extending existing acceptability concepts to a new language.
The authors tackled the problem of linguistic acceptability for large language models by creating CoLAC, the first large-scale acceptability dataset for Chinese, a non-Indo-European language, with dual labels from linguists and crowds. They found that even the largest InstructGPT model performed at chance level, while ChatGPT scored 48.30 MCC, below supervised models (59.03 MCC) and humans (65.11 MCC).
In this work, we revisit linguistic acceptability in the context of large language models. We introduce CoLAC - Corpus of Linguistic Acceptability in Chinese, the first large-scale acceptability dataset for a non-Indo-European language. It is verified by native speakers and is the first acceptability dataset that comes with two sets of labels: a linguist label and a crowd label. Our experiments show that even the largest InstructGPT model performs only at chance level on CoLAC, while ChatGPT's performance (48.30 MCC) is also much below supervised models (59.03 MCC) and human (65.11 MCC). Through cross-lingual transfer experiments and fine-grained linguistic analysis, we provide detailed analysis of the model predictions and demonstrate for the first time that knowledge of linguistic acceptability can be transferred across typologically distinct languages, as well as be traced back to pre-training. Our dataset is publicly available at \url{https://github.com/huhailinguist/CoLAC}.