Bevan Koopman

IR
h-index41
28papers
1,089citations
Novelty48%
AI Score59

28 Papers

IROct 14, 2023Code
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

Shengyao Zhuang, Honglei Zhuang, Bevan Koopman et al. · deepmind

We propose a novel zero-shot document ranking approach based on Large Language Models (LLMs): the Setwise prompting approach. Our approach complements existing prompting approaches for LLM-based zero-shot ranking: Pointwise, Pairwise, and Listwise. Through the first-of-its-kind comparative evaluation within a consistent experimental framework and considering factors like model size, token consumption, latency, among others, we show that existing approaches are inherently characterised by trade-offs between effectiveness and efficiency. We find that while Pointwise approaches score high on efficiency, they suffer from poor effectiveness. Conversely, Pairwise approaches demonstrate superior effectiveness but incur high computational overhead. Our Setwise approach, instead, reduces the number of LLM inferences and the amount of prompt token consumption during the ranking procedure, compared to previous methods. This significantly improves the efficiency of LLM-based zero-shot ranking, while also retaining high zero-shot ranking effectiveness. We make our code and results publicly available at \url{https://github.com/ielab/llm-rankers}.

CRMay 27
Can It Reach the Generator? Investigating the Survival of Prompt-Injection Attacks in Realistic RAG Settings

Yu Yin, Shuai Wang, Bevan Koopman et al.

Recent generative engine optimisation (GEO) research has shown that prompt-injection attacks can push a target product to the top of an LLM's recommendation list, with the strongest attacks reporting around $80\%$ success and raising serious security concerns about RAG-based recommendation. However, these results assume the attacked document is always fed directly to the generator, bypassing the retriever and reranker. This is unrealistic: in deployed RAG systems, the attack modifies the document content, which can in turn change whether the document is retrieved and reranked highly enough to reach the generator at all. In this paper, we re-evaluate seven GEO attacks under a realistic three-stage pipeline (retriever\,$\to$\,LLM reranker\,$\to$\,LLM generator). We find that prior protocols substantially overstate attack effectiveness: gradient-based and instruction override attacks largely collapse before reaching the generator, and only LLM-driven prompt injections remain effective end-to-end. Our analysis further reveals that current GEO attacks are easily detectable: a lightweight prompt-injection guard finetuned on a small attack dataset already detects every attack. Our code and data are available at https://anonymous.4open.science/r/geo_injection_rag_survival_anonymizations-8C12.

IROct 20, 2023Code
Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking

Shengyao Zhuang, Bing Liu, Bevan Koopman et al.

In the field of information retrieval, Query Likelihood Models (QLMs) rank documents based on the probability of generating the query given the content of a document. Recently, advanced large language models (LLMs) have emerged as effective QLMs, showcasing promising ranking capabilities. This paper focuses on investigating the genuine zero-shot ranking effectiveness of recent LLMs, which are solely pre-trained on unstructured text data without supervised instruction fine-tuning. Our findings reveal the robust zero-shot ranking ability of such LLMs, highlighting that additional instruction fine-tuning may hinder effectiveness unless a question generation task is present in the fine-tuning dataset. Furthermore, we introduce a novel state-of-the-art ranking system that integrates LLM-based QLMs with a hybrid zero-shot retriever, demonstrating exceptional effectiveness in both zero-shot and few-shot scenarios. We make our codebase publicly available at https://github.com/ielab/llm-qlm.

IRMay 29
On the impact of retrieved content representations in RAG Pipelines

Jonathan J Ross, Bevan Koopman, Anton van der Vegt et al.

Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.

CVJul 19, 2023Code
Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

Aaron Nicolson, Jason Dowling, Bevan Koopman

Radiologists face high burnout rates, partially due to the increasing volume of Chest X-rays (CXRs) requiring interpretation and reporting. Automated CXR report generation holds promise for reducing this burden and improving patient care. While current models show potential, their diagnostic accuracy is limited. Our proposed CXR report generator integrates elements of the radiologist workflow and introduces a novel reward for reinforcement learning. Our approach leverages longitudinal data from a patient's prior CXR study and effectively handles cases where no prior study exist, thus mirroring the radiologist's workflow. In contrast, existing models typically lack this flexibility, often requiring prior studies for the model to function optimally. Our approach also incorporates all CXRs from a patient's study and distinguishes between report sections through section embeddings. Our reward for reinforcement learning leverages CXR-BERT, which forces our model to learn the clinical semantics of radiology reporting. We conduct experiments on publicly available datasets -- MIMIC-CXR and Open-i IU X-ray -- with metrics shown to more closely correlate with radiologists' assessment of reporting. Results from our study demonstrate that the proposed model generates reports that are more aligned with radiologists' reports than state-of-the-art models, such as those utilising large language models, reinforcement learning, and multi-task learning. The proposed model improves the diagnostic accuracy of CXR report generation, which could one day reduce radiologists' workload and enhance patient care. Our Hugging Face checkpoint (https://huggingface.co/aehrc/cxrmate) and code (https://github.com/aehrc/cxrmate) are publicly available.

IRFeb 3, 2023
Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?

Shuai Wang, Harrisen Scells, Bevan Koopman et al.

Systematic reviews are comprehensive reviews of the literature for a highly focused research question. These reviews are often treated as the highest form of evidence in evidence-based medicine, and are the key strategy to answer research questions in the medical field. To create a high-quality systematic review, complex Boolean queries are often constructed to retrieve studies for the review topic. However, it often takes a long time for systematic review researchers to construct a high quality systematic review Boolean query, and often the resulting queries are far from effective. Poor queries may lead to biased or invalid reviews, because they missed to retrieve key evidence, or to extensive increase in review costs, because they retrieved too many irrelevant studies. Recent advances in Transformer-based generative models have shown great potential to effectively follow instructions from users and generate answers based on the instructions being made. In this paper, we investigate the effectiveness of the latest of such models, ChatGPT, in generating effective Boolean queries for systematic review literature search. Through a number of extensive experiments on standard test collections for the task, we find that ChatGPT is capable of generating queries that lead to high search precision, although trading-off this for recall. Overall, our study demonstrates the potential of ChatGPT in generating effective Boolean queries for systematic review literature search. The ability of ChatGPT to follow complex instructions and generate queries with high precision makes it a valuable tool for researchers conducting systematic reviews, particularly for rapid reviews where time is a constraint and often trading-off higher precision for lower recall is acceptable.

IRJun 1
Whole-Pool Setwise Reranking with Long-Context Language Models

Hang Li, Chuting Yu, Teerapong Leelanupab et al.

Previous LLM-based passage re-rankers are often expensive and slow because the input context constraints require the LLM to make many dependent model calls. We study how recent long-context LLMs change this problem: when the full set of retrieved candidate passages can be shown to the model at once, ranking no longer has to be reconstructed from many overlapping local comparisons. We propose Whole-Pool Setwise re-ranking, where each call considers all currently unranked candidate passages, and introduce DualEnd, which identifies both the most and least relevant passages in one call. By filling the ranking from both ends, DualEnd ranks 100 candidates with 50 serial LLM calls, compared with 99 calls for comparable one-passage-at-a-time whole-pool methods. Experiments with nine open-weight LLMs on two passage re-ranking benchmarks, measuring effectiveness, call count, token use, runtime, and output reliability shows that long context is not merely more prompt space, but an opportunity to make LLM re-rankers both effective and efficient.

CLFeb 23, 2023
Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Guido Zuccon, Bevan Koopman

Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.

AISep 17, 2023
ChatGPT Hallucinates when Attributing Answers

Guido Zuccon, Bevan Koopman, Razia Shaik

Can ChatGPT provide evidence to support its answers? Does the evidence it suggests actually exist and does it really support its answer? We investigate these questions using a collection of domain-specific knowledge-based questions, specifically prompting ChatGPT to provide both an answer and supporting evidence in the form of references to external sources. We also investigate how different prompts impact answers and evidence. We find that ChatGPT provides correct or partially correct answers in about half of the cases (50.6% of the times), but its suggested references only exist 14% of the times. We further provide insights on the generated references that reveal common traits among the references that ChatGPT generates, and show how even if a reference provided by the model does exist, this reference often does not support the claims ChatGPT attributes to it. Our findings are important because (1) they are the first systematic analysis of the references created by ChatGPT in its answers; (2) they suggest that the model may leverage good quality information in producing correct answers, but is unable to attribute real evidence to support its answers. Prompts, raw result files and manual analysis are made publicly available.

IRDec 18, 2022
Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search

Shuai Wang, Harrisen Scells, Bevan Koopman et al.

Medical systematic reviews typically require assessing all the documents retrieved by a search. The reason is two-fold: the task aims for ``total recall''; and documents retrieved using Boolean search are an unordered set, and thus it is unclear how an assessor could examine only a subset. Screening prioritisation is the process of ranking the (unordered) set of retrieved documents, allowing assessors to begin the downstream processes of the systematic review creation earlier, leading to earlier completion of the review, or even avoiding screening documents ranked least relevant. Screening prioritisation requires highly effective ranking methods. Pre-trained language models are state-of-the-art on many IR tasks but have yet to be applied to systematic review screening prioritisation. In this paper, we apply several pre-trained language models to the systematic review document ranking task, both directly and fine-tuned. An empirical analysis compares how effective neural methods compare to traditional methods for this task. We also investigate different types of document representations for neural methods and their impact on ranking performance. Our results show that BERT-based rankers outperform the current state-of-the-art screening prioritisation methods. However, BERT rankers and existing methods can actually be complementary, and thus, further improvements may be achieved if used in conjunction.

IRSep 19, 2022
Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search

Shuai Wang, Harrisen Scells, Bevan Koopman et al.

High-quality medical systematic reviews require comprehensive literature searches to ensure the recommendations and outcomes are sufficiently reliable. Indeed, searching for relevant medical literature is a key phase in constructing systematic reviews and often involves domain (medical researchers) and search (information specialists) experts in developing the search queries. Queries in this context are highly complex, based on Boolean logic, include free-text terms and index terms from standardised terminologies (e.g., the Medical Subject Headings (MeSH) thesaurus), and are difficult and time-consuming to build. The use of MeSH terms, in particular, has been shown to improve the quality of the search results. However, identifying the correct MeSH terms to include in a query is difficult: information experts are often unfamiliar with the MeSH database and unsure about the appropriateness of MeSH terms for a query. Naturally, the full value of the MeSH terminology is often not fully exploited. This article investigates methods to suggest MeSH terms based on an initial Boolean query that includes only free-text terms. In this context, we devise lexical and pre-trained language models based methods. These methods promise to automatically identify highly effective MeSH terms for inclusion in a systematic review query. Our study contributes an empirical evaluation of several MeSH term suggestion methods. We further contribute an extensive analysis of MeSH term suggestions for each method and how these suggestions impact the effectiveness of Boolean queries.

IRSep 11, 2023
Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation

Shuai Wang, Harrisen Scells, Martin Potthast et al.

Screening prioritisation in medical systematic reviews aims to rank the set of documents retrieved by complex Boolean queries. Prioritising the most important documents ensures that subsequent review steps can be carried out more efficiently and effectively. The current state of the art uses the final title of the review as a query to rank the documents using BERT-based neural rankers. However, the final title is only formulated at the end of the review process, which makes this approach impractical as it relies on ex post facto information. At the time of screening, only a rough working title is available, with which the BERT-based ranker performs significantly worse than with the final title. In this paper, we explore alternative sources of queries for prioritising screening, such as the Boolean query used to retrieve the documents to be screened and queries generated by instruction-based generative large-scale language models such as ChatGPT and Alpaca. Our best approach is not only viable based on the information available at the time of screening, but also has similar effectiveness to the final title.

CVAug 7, 2024Code
e-Health CSIRO at RRG24: Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation

Aaron Nicolson, Jinghui Liu, Jason Dowling et al.

The Shared Task on Large-Scale Radiology Report Generation (RRG24) aims to expedite the development of assistive systems for interpreting and reporting on chest X-ray (CXR) images. This task challenges participants to develop models that generate the findings and impression sections of radiology reports from CXRs from a patient's study, using five different datasets. This paper outlines the e-Health CSIRO team's approach, which achieved multiple first-place finishes in RRG24. The core novelty of our approach lies in the addition of entropy regularisation to self-critical sequence training, to maintain a higher entropy in the token distribution. This prevents overfitting to common phrases and ensures a broader exploration of the vocabulary during training, essential for handling the diversity of the radiology reports in the RRG24 datasets. Our model is available on Hugging Face https://huggingface.co/aehrc/cxrmate-rrg24.

CLJul 3, 2024Code
e-Health CSIRO at "Discharge Me!" 2024: Generating Discharge Summary Sections with Fine-tuned Language Models

Jinghui Liu, Aaron Nicolson, Jason Dowling et al.

Clinical documentation is an important aspect of clinicians' daily work and often demands a significant amount of time. The BioNLP 2024 Shared Task on Streamlining Discharge Documentation (Discharge Me!) aims to alleviate this documentation burden by automatically generating discharge summary sections, including brief hospital course and discharge instruction, which are often time-consuming to synthesize and write manually. We approach the generation task by fine-tuning multiple open-sourced language models (LMs), including both decoder-only and encoder-decoder LMs, with various configurations on input context. We also examine different setups for decoding algorithms, model ensembling or merging, and model specialization. Our results show that conditioning on the content of discharge summary prior to the target sections is effective for the generation task. Furthermore, we find that smaller encoder-decoder LMs can work as well or even slightly better than larger decoder based LMs fine-tuned through LoRA. The model checkpoints from our team (aehrc) are openly available.

IRMay 8Code
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Shuai Wang, Yin Yu, Shengyao Zhuang et al.

PromptReps showed that an autoregressive language model can be used directly as a retriever by prompting it to generate dense and sparse representations of a query or passage. Extending this to multiple representatives is inefficient for autoregressive models, since tokens must be generated sequentially, and prior multi-token variants did not reliably improve over single-token decoding. We show that the bottleneck is sequential generation, not the multi-token idea itself. DiffRetriever is a representative-token retriever for diffusion language models: it appends K masked positions to the prompt and reads all K in a single bidirectional forward pass. Across in-domain and out-of-domain evaluation, multi-token DiffRetriever substantially improves over single-token on every diffusion backbone we test, while autoregressive multi-token is flat or negative and pays a latency cost that scales with K where diffusion does not. After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison, ahead of PromptReps, the encoder-style DiffEmbed baseline on the same diffusion backbones, and the contrastively fine-tuned single-vector RepLLaMA. A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget, pointing to adaptive budget selection as future work. Code is available at https://github.com/ielab/diffretriever.

CVApr 21
Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

Aaron Nicolson, Elizabeth J. Cooper, Hwan-Jin Yoon et al.

Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.

IRNov 26, 2024Code
2D Matryoshka Training for Information Retrieval

Shuai Wang, Shengyao Zhuang, Bevan Koopman et al.

2D Matryoshka Training is an advanced embedding representation training approach designed to train an encoder model simultaneously across various layer-dimension setups. This method has demonstrated higher effectiveness in Semantic Text Similarity (STS) tasks over traditional training approaches when using sub-layers for embeddings. Despite its success, discrepancies exist between two published implementations, leading to varied comparative results with baseline models. In this reproducibility study, we implement and evaluate both versions of 2D Matryoshka Training on STS tasks and extend our analysis to retrieval tasks. Our findings indicate that while both versions achieve higher effectiveness than traditional Matryoshka training on sub-dimensions, and traditional full-sized model training approaches, they do not outperform models trained separately on specific sub-layer and sub-dimension setups. Moreover, these results generalize well to retrieval tasks, both in supervised (MSMARCO) and zero-shot (BEIR) settings. Further explorations of different loss computations reveals more suitable implementations for retrieval tasks, such as incorporating full-dimension loss and training on a broader range of target dimensions. Conversely, some intuitive approaches, such as fixing document encoders to full model outputs, do not yield improvements. Our reproduction code is available at https://github.com/ielab/2DMSE-Reproduce.

CVJan 24, 2022Code
Improving Chest X-Ray Report Generation by Leveraging Warm Starting

Aaron Nicolson, Jason Dowling, Bevan Koopman

Automatically generating a report from a patient's Chest X-Rays (CXRs) is a promising solution to reducing clinical workload and improving patient care. However, current CXR report generators -- which are predominantly encoder-to-decoder models -- lack the diagnostic accuracy to be deployed in a clinical setting. To improve CXR report generation, we investigate warm starting the encoder and decoder with recent open-source computer vision and natural language processing checkpoints, such as the Vision Transformer (ViT) and PubMedBERT. To this end, each checkpoint is evaluated on the MIMIC-CXR and IU X-Ray datasets. Our experimental investigation demonstrates that the Convolutional vision Transformer (CvT) ImageNet-21K and the Distilled Generative Pre-trained Transformer 2 (DistilGPT2) checkpoints are best for warm starting the encoder and decoder, respectively. Compared to the state-of-the-art ($\mathcal{M}^2$ Transformer Progressive), CvT2DistilGPT2 attained an improvement of 8.3\% for CE F-1, 1.8\% for BLEU-4, 1.6\% for ROUGE-L, and 1.0\% for METEOR. The reports generated by CvT2DistilGPT2 have a higher similarity to radiologist reports than previous approaches. This indicates that leveraging warm starting improves CXR report generation. Code and checkpoints for CvT2DistilGPT2 are available at https://github.com/aehrc/cvt2distilgpt2.

IRMar 8, 2025
Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning

Shengyao Zhuang, Xueguang Ma, Bevan Koopman et al.

In this paper, we introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task. Existing document reranking methods based on large language models (LLMs) typically rely on prompting or fine-tuning LLMs to order or label candidate documents according to their relevance to a query. For Rank-R1, we use a reinforcement learning algorithm along with only a small set of relevance labels (without any reasoning supervision) to enhance the reasoning ability of LLM-based rerankers. Our hypothesis is that adding reasoning capabilities to the rerankers can improve their relevance assessement and ranking capabilities. Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries. In particular, we find that Rank-R1 achieves effectiveness on in-domain datasets at par with that of supervised fine-tuning methods, but utilizing only 18\% of the training data used by the fine-tuning methods. We also find that the model largely outperforms zero-shot and supervised fine-tuning when applied to out-of-domain datasets featuring complex queries, especially when a 14B-size model is used. Finally, we qualitatively observe that Rank-R1's reasoning process improves the explainability of the ranking results, opening new opportunities for search engine results presentation and fruition.

IRJan 12, 2024
Zero-shot Generative Large Language Models for Systematic Review Screening Automation

Shuai Wang, Harrisen Scells, Shengyao Zhuang et al.

Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models~(LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.

IRJan 31, 2024
ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search

Shuai Wang, Shengyao Zhuang, Bevan Koopman et al.

Federated search, which involves integrating results from multiple independent search engines, will become increasingly pivotal in the context of Retrieval-Augmented Generation pipelines empowering LLM-based applications such as chatbots. These systems often distribute queries among various search engines, ranging from specialized (e.g., PubMed) to general (e.g., Google), based on the nature of user utterances. A critical aspect of federated search is resource selection - the selection of appropriate resources prior to issuing the query to ensure high-quality and rapid responses, and contain costs associated with calling the external search engines. However, current SOTA resource selection methodologies primarily rely on feature-based learning approaches. These methods often involve the labour intensive and expensive creation of training labels for each resource. In contrast, LLMs have exhibited strong effectiveness as zero-shot methods across NLP and IR tasks. We hypothesise that in the context of federated search LLMs can assess the relevance of resources without the need for extensive predefined labels or features. In this paper, we propose ReSLLM. Our ReSLLM method exploits LLMs to drive the selection of resources in federated search in a zero-shot setting. In addition, we devise an unsupervised fine tuning protocol, the Synthetic Label Augmentation Tuning (SLAT), where the relevance of previously logged queries and snippets from resources is predicted using an off-the-shelf LLM and then in turn used to fine-tune ReSLLM with respect to resource selection. Our empirical evaluation and analysis details the factors influencing the effectiveness of LLMs in this context. The results showcase the merits of ReSLLM for resource selection: not only competitive effectiveness in the zero-shot setting, but also obtaining large when fine-tuned using SLAT-protocol.

IRFeb 20, 2024
Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

Shengyao Zhuang, Bevan Koopman, Xiaoran Chu et al.

The emergence of Vec2Text -- a method for text embedding inversion -- has raised serious privacy concerns for dense retrieval systems which use text embeddings, such as those offered by OpenAI and Cohere. This threat comes from the ability for a malicious attacker with access to embeddings to reconstruct the original text. In this paper, we investigate various factors related to embedding models that may impact text recoverability via Vec2Text. We explore factors such as distance metrics, pooling functions, bottleneck pre-training, training with noise addition, embedding quantization, and embedding dimensions, which were not considered in the original Vec2Text paper. Through a comprehensive analysis of these factors, our objective is to gain a deeper understanding of the key elements that affect the trade-offs between the text recoverability and retrieval effectiveness of dense retrieval systems, offering insights for practitioners designing privacy-aware dense retrieval systems. We also propose a simple embedding transformation fix that guarantees equal ranking effectiveness while mitigating the recoverability risk. Overall, this study reveals that Vec2Text could pose a threat to current dense retrieval systems, but there are some effective methods to patch such systems.

IRApr 2, 2025
LLM-VPRF: Large Language Model Based Vector Pseudo Relevance Feedback

Hang Li, Shengyao Zhuang, Bevan Koopman et al.

Vector Pseudo Relevance Feedback (VPRF) has shown promising results in improving BERT-based dense retrieval systems through iterative refinement of query representations. This paper investigates the generalizability of VPRF to Large Language Model (LLM) based dense retrievers. We introduce LLM-VPRF and evaluate its effectiveness across multiple benchmark datasets, analyzing how different LLMs impact the feedback mechanism. Our results demonstrate that VPRF's benefits successfully extend to LLM architectures, establishing it as a robust technique for enhancing dense retrieval performance regardless of the underlying models. This work bridges the gap between VPRF with traditional BERT-based dense retrievers and modern LLMs, while providing insights into their future directions.

IRMar 19, 2025
Pseudo Relevance Feedback is Enough to Close the Gap Between Small and Large Dense Retrieval Models

Hang Li, Xiao Wang, Bevan Koopman et al.

Scaling dense retrievers to larger large language model (LLM) backbones has been a dominant strategy for improving their retrieval effectiveness. However, this has substantial cost implications: larger backbones require more expensive hardware (e.g. GPUs with more memory) and lead to higher indexing and querying costs (latency, energy consumption). In this paper, we challenge this paradigm by introducing PromptPRF, a feature-based pseudo-relevance feedback (PRF) framework that enables small LLM-based dense retrievers to achieve effectiveness comparable to much larger models. PromptPRF uses LLMs to extract query-independent, structured and unstructured features (e.g., entities, summaries, chain-of-thought keywords, essay) from top-ranked documents. These features are generated offline and integrated into dense query representations via prompting, enabling efficient retrieval without additional training. Unlike prior methods such as GRF, which rely on online, query-specific generation and sparse retrieval, PromptPRF decouples feedback generation from query processing and supports dense retrievers in a fully zero-shot setting. Experiments on TREC DL and BEIR benchmarks demonstrate that PromptPRF consistently improves retrieval effectiveness and offers favourable cost-effectiveness trade-offs. We further present ablation studies to understand the role of positional feedback and analyse the interplay between feature extractor size, PRF depth, and model performance. Our findings demonstrate that with effective PRF design, scaling the retriever is not always necessary, narrowing the gap between small and large models while reducing inference cost.

IRMay 12, 2025
Reassessing Large Language Model Boolean Query Generation for Systematic Reviews

Shuai Wang, Harrisen Scells, Bevan Koopman et al.

Systematic reviews are comprehensive literature reviews that address highly focused research questions and represent the highest form of evidence in medicine. A critical step in this process is the development of complex Boolean queries to retrieve relevant literature. Given the difficulty of manually constructing these queries, recent efforts have explored Large Language Models (LLMs) to assist in their formulation. One of the first studies,Wang et al., investigated ChatGPT for this task, followed by Staudinger et al., which evaluated multiple LLMs in a reproducibility study. However, the latter overlooked several key aspects of the original work, including (i) validation of generated queries, (ii) output formatting constraints, and (iii) selection of examples for chain-of-thought (Guided) prompting. As a result, its findings diverged significantly from the original study. In this work, we systematically reproduce both studies while addressing these overlooked factors. Our results show that query effectiveness varies significantly across models and prompt designs, with guided query formulation benefiting from well-chosen seed studies. Overall, prompt design and model selection are key drivers of successful query formulation. Our findings provide a clearer understanding of LLMs' potential in Boolean query generation and highlight the importance of model- and prompt-specific optimisations. The complex nature of systematic reviews adds to challenges in both developing and reproducing methods but also highlights the importance of reproducibility studies in this domain.

CVJun 19, 2024
The Impact of Auxiliary Patient Data on Automated Chest X-Ray Report Generation and How to Incorporate It

Aaron Nicolson, Shengyao Zhuang, Jason Dowling et al.

This study investigates the integration of diverse patient data sources into multimodal language models for automated chest X-ray (CXR) report generation. Traditionally, CXR report generation relies solely on CXR images and limited radiology data, overlooking valuable information from patient health records, particularly from emergency departments. Utilising the MIMIC-CXR and MIMIC-IV-ED datasets, we incorporate detailed patient information such as vital signs, medicines, and clinical history to enhance diagnostic accuracy. We introduce a novel approach to transform these heterogeneous data sources into embeddings that prompt a multimodal language model; this significantly enhances the diagnostic accuracy of generated radiology reports. Our comprehensive evaluation demonstrates the benefits of using a broader set of patient data, underscoring the potential for enhanced diagnostic capabilities and better patient outcomes through the integration of multimodal data in CXR report generation.

CLJan 1, 2022
Semantic Search for Large Scale Clinical Ontologies

Duy-Hoa Ngo, Madonna Kemp, Donna Truran et al.

Finding concepts in large clinical ontologies can be challenging when queries use different vocabularies. A search algorithm that overcomes this problem is useful in applications such as concept normalisation and ontology matching, where concepts can be referred to in different ways, using different synonyms. In this paper, we present a deep learning based approach to build a semantic search system for large clinical ontologies. We propose a Triplet-BERT model and a method that generates training data directly from the ontologies. The model is evaluated using five real benchmark data sets and the results show that our approach achieves high results on both free text to concept and concept to concept searching tasks, and outperforms all baseline methods.

IRAug 25, 2021
Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls

Hang Li, Ahmed Mourad, Shengyao Zhuang et al.

Pseudo Relevance Feedback (PRF) is known to improve the effectiveness of bag-of-words retrievers. At the same time, deep language models have been shown to outperform traditional bag-of-words rerankers. However, it is unclear how to integrate PRF directly with emergent deep language models. In this article, we address this gap by investigating methods for integrating PRF signals into rerankers and dense retrievers based on deep language models. We consider text-based and vector-based PRF approaches, and investigate different ways of combining and scoring relevance signals. An extensive empirical evaluation was conducted across four different datasets and two task settings (retrieval and ranking). Text-based PRF results show that the use of PRF had a mixed effect on deep rerankers across different datasets. We found that the best effectiveness was achieved when (i) directly concatenating each PRF passage with the query, searching with the new set of queries, and then aggregating the scores; (ii) using Borda to aggregate scores from PRF runs. Vector-based PRF results show that the use of PRF enhanced the effectiveness of deep rerankers and dense retrievers over several evaluation metrics. We found that higher effectiveness was achieved when (i) the query retains either the majority or the same weight within the PRF mechanism, and (ii) a shallower PRF signal (i.e., a smaller number of top-ranked passages) was employed, rather than a deeper signal. Our vector-based PRF method is computationally efficient; thus this represents a general PRF method others can use with deep rerankers and dense retrievers.