CLSep 26, 2025

How Accurate Are LLMs at Multi-Question Answering on Conversational Transcripts?

arXiv:2509.21732v11 citationsh-index: 2EMNLP
Originality Synthesis-oriented
AI Analysis

This addresses computational cost and latency issues for industrial QA deployments, though it appears incremental as it benchmarks existing models rather than introducing new methods.

The paper tackled the challenge of using Large Language Models for multi-question answering on conversational transcripts, finding that fine-tuned public LLMs with up to 8 billion parameters can surpass GPT-4o in accuracy for this task.

Deploying Large Language Models (LLMs) for question answering (QA) over lengthy contexts is a significant challenge. In industrial settings, this process is often hindered by high computational costs and latency, especially when multiple questions must be answered based on the same context. In this work, we explore the capabilities of LLMs to answer multiple questions based on the same conversational context. We conduct extensive experiments and benchmark a range of both proprietary and public models on this challenging task. Our findings highlight that while strong proprietary LLMs like GPT-4o achieve the best overall performance, fine-tuned public LLMs with up to 8 billion parameters can surpass GPT-4o in accuracy, which demonstrates their potential for transparent and cost-effective deployment in real-world applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes