An Empirical Study On Contrastive Search And Contrastive Decoding For Open-ended Text Generation
This work addresses the evaluation of decoding methods for text generation, revealing a mismatch between automatic metrics and human preferences, which is incremental as it compares existing methods.
The study empirically compares Contrastive Search (CS) and Contrastive Decoding (CD) for open-ended text generation, finding that CS performs worse on MAUVE but better on diversity and coherence metrics, with human evaluations strongly favoring CS over CD.
In the study, we empirically compare the two recently proposed decoding methods, i.e. Contrastive Search (CS) and Contrastive Decoding (CD), for open-ended text generation. The automatic evaluation results suggest that, while CS performs worse than CD on the MAUVE metric, it substantially surpasses CD on the diversity and coherence metrics. More notably, extensive human evaluations across three different domains demonstrate that human annotators are universally more in favor of CS over CD with substantial margins. The contradicted results between MAUVE and human evaluations reveal that MAUVE does not accurately reflect human preferences. Therefore, we call upon the research community to develop better evaluation metrics for open-ended text generation. To ensure the reproducibility of our work, we have open-sourced all our code, evaluation results, as well as human annotations at https://github.com/yxuansu/Contrastive_Search_versus_Contrastive_Decoding.