Maaike H. T. de Boer

AI
5papers
12citations
Novelty21%
AI Score34

5 Papers

61.5IRMay 1
Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?

Lennard C. Froma, Tom Kouwenhoven, Maaike H. T. de Boer et al.

Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.

AIDec 5, 2025
Ontology Learning with LLMs: A Benchmark Study on Axiom Identification

Roos M. Bakker, Daan L. Di Scala, Maaike H. T. de Boer et al.

Ontologies are an important tool for structuring domain knowledge, but their development is a complex task that requires significant modelling and domain expertise. Ontology learning, aimed at automating this process, has seen advancements in the past decade with the improvement of Natural Language Processing techniques, and especially with the recent growth of Large Language Models (LLMs). This paper investigates the challenge of identifying axioms: fundamental ontology components that define logical relations between classes and properties. In this work, we introduce an Ontology Axiom Benchmark OntoAxiom, and systematically test LLMs on that benchmark for axiom identification, evaluating different prompting strategies, ontologies, and axiom types. The benchmark consists of nine medium-sized ontologies with together 17.118 triples, and 2.771 axioms. We focus on subclass, disjoint, subproperty, domain, and range axioms. To evaluate LLM performance, we compare twelve LLMs with three shot settings and two prompting strategies: a Direct approach where we query all axioms at once, versus an Axiom-by-Axiom (AbA) approach, where each prompt queries for one axiom only. Our findings show that the AbA prompting leads to higher F1 scores than the direct approach. However, performance varies across axioms, suggesting that certain axioms are more challenging to identify. The domain also influences performance: the FOAF ontology achieves a score of 0.642 for the subclass axiom, while the music ontology reaches only 0.218. Larger LLMs outperform smaller ones, but smaller models may still be viable for resource-constrained settings. Although performance overall is not high enough to fully automate axiom identification, LLMs can provide valuable candidate axioms to support ontology engineers with the development and refinement of ontologies.

AISep 20, 2021
Modular Design Patterns for Hybrid Actors

André Meyer-Vitali, Wico Mulder, Maaike H. T. de Boer

Recently, a boxology (graphical language) with design patterns for hybrid AI was proposed, combining symbolic and sub-symbolic learning and reasoning. In this paper, we extend this boxology with actors and their interactions. The main contributions of this paper are: 1) an extension of the taxonomy to describe distributed hybrid AI systems with actors and interactions; and 2) showing examples using a few design patterns relevant in multi-agent systems and human-agent interaction.

CLSep 29, 2020
Gender prediction using limited Twitter Data

Maaike Burghoorn, Maaike H. T. de Boer, Stephan Raaijmakers

Transformer models have shown impressive performance on a variety of NLP tasks. Off-the-shelf, pre-trained models can be fine-tuned for specific NLP classification tasks, reducing the need for large amounts of additional training data. However, little research has addressed how much data is required to accurately fine-tune such pre-trained transformer models, and how much data is needed for accurate prediction. This paper explores the usability of BERT (a Transformer model for word embedding) for gender prediction on social media. Forensic applications include detecting gender obfuscation, e.g. males posing as females in chat rooms. A Dutch BERT model is fine-tuned on different samples of a Dutch Twitter dataset labeled for gender, varying in the number of tweets used per person. The results show that finetuning BERT contributes to good gender classification performance (80% F1) when finetuned on only 200 tweets per person. But when using just 20 tweets per person, the performance of our classifier deteriorates non-steeply (to 70% F1). These results show that even with relatively small amounts of data, BERT can be fine-tuned to accurately help predict the gender of Twitter users, and, consequently, that it is possible to determine gender on the basis of just a low volume of tweets. This opens up an operational perspective on the swift detection of gender.

CLOct 21, 2019
Opinion aspect extraction in Dutch childrens diary entries

Hella Haanstra, Maaike H. T. de Boer

Aspect extraction can be used in dialogue systems to understand the topic of opinionated text. Expressing an empathetic reaction to an opinion can strengthen the bond between a human and, for example, a robot. The aim of this study is three-fold: 1. create a new annotated dataset for both aspect extraction and opinion words for Dutch childrens language, 2. acquire aspect extraction results for this task and 3. improve current results for aspect extraction in Dutch reviews. This was done by training a deep learning Gated Recurrent Unit (GRU) model, originally developed for an English review dataset, on Dutch restaurant review data to classify both opinion words and their respective aspects. We obtained state-of-the-art performance on the Dutch restaurant review dataset. Additionally, we acquired aspect extraction results for the Dutch childrens dataset. Since the model was trained on standardised language, these results are quite promising.