POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling
This work addresses the problem of evaluating conversational search systems for developers and researchers, offering an incremental improvement by incorporating syntactic information into existing metrics.
The authors tackled the challenge of evaluating conversational search systems by proposing POSSCORE, an automatic evaluation method that incorporates part-of-speech information, which achieved significant improvements over state-of-the-art baseline metrics in correlating with human preferences.
Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE, a simple yet effective automatic evaluation method for conversational search. The proposed embedding-based metric takes the influence of part of speech (POS) of the terms in the response into account. To the best knowledge, our work is the first to systematically demonstrate the importance of incorporating syntactic information, such as POS labels, for conversational search evaluation. Experimental results demonstrate that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics.