Zero-shot Conversational Summarization Evaluations with small Large Language Models
This work addresses the under-explored problem of conversational summarization for researchers and practitioners, but it is incremental as it focuses on evaluating existing models rather than introducing new methods.
The study evaluated small large language models (approx. 10 billion parameters) on conversational summarization, finding that performance varies significantly with different prompts, sometimes causing steep drops in ROUGE scores, and highlighted limitations through human evaluations.
Large Language Models (LLMs) exhibit powerful summarization abilities. However, their capabilities on conversational summarization remains under explored. In this work we evaluate LLMs (approx. 10 billion parameters) on conversational summarization and showcase their performance on various prompts. We show that the summaries generated by models depend on the instructions and the performance of LLMs vary with different instructions sometimes resulting steep drop in ROUGE scores if prompts are not selected carefully. We also evaluate the models with human evaluations and discuss the limitations of the models on conversational summarization