On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral
For legal intake workers and applicants, this work proposes an evaluation rubric for legal triage questions and highlights the need for higher-cost models and domain-specific screening protocols.
The FETCH classifier uses LLMs to generate follow-up questions for legal triage. Results show that low-cost LLMs produce poor-quality questions, while adding a high-cost model (GPT-5) improves classification accuracy, though fact elicitation remains uneven across legal categories.
The FETCH classifier generates follow-up questions to help refine the best match for the applicant's legal problem, using a low-cost ensemble of LLMs. In this paper, we describe an expert attorney and LLM-assisted evaluation of the follow-up question approach in FETCH and show that while low-cost LLMs perform well at classification tasks, generating high-quality plain-language questions in this setting appears to require a more sophisticated and higher-cost model. Through discussion with legal intake workers, we propose a rubric for the evaluation of legal intake classification questions, and we find that prompt engineering alone is not enough to improve question quality for intake purposes. We also find that LLM-as-judge and human ratings diverge. We demonstrate that with the addition of a single high-cost model, GPT-5, the classifier can elicit relevant information from applicants for legal help, and that the questions lead to more accurate performance at classification tasks. We also find uneven fact elicitation across different categories, including domestic violence, at odds with family law screening protocols, suggesting the value of including dedicated screening panels for certain areas of law.