CL HCSep 12, 2025

Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs

Adnan Ahmad, Philine Kowol, Stefan Hillmann, Sebastian Möller

arXiv:2509.10010v16.72 citationsh-index: 1Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of multi-intent recognition for developers of task-oriented chatbots, but it is incremental as it compares existing models on a known benchmark.

This paper compared the performance of three open-source LLMs (LLama2-7B-hf, Mistral-7B-v0.1, Yi-6B) on multi-label intent classification using the MultiWOZ 2.1 dataset in a few-shot setup, finding that Mistral-7B-v0.1 outperformed the others with a weighted F1 score of 0.50, but a BERT-based supervised classifier achieved superior results.

In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.

View on arXiv PDF

Similar