58.1LGMay 28Code
AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed AnalysisSaeedeh Davoudi, Reihaneh Iranmanesh, Ophir Frieder et al.
Medical knowledge is continuously evolving. This creates a need to update or selectively forget information encoded in already-trained medical LLMs. Machine unlearning aims to remove the influence of specific training data from a model without full retraining. Yet, existing unlearning benchmarks rely on synthetic or small-scale general data, leaving clinical unlearning understudied. We introduce AMNESIA, the first large-scale, open source benchmark for medical unlearning, with 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. AMNESIA includes both factual questions testing direct recall and reasoning questions testing clinical inference. We use it to evaluate four widely used unlearning methods at both random patient and disease-level, and introduce a new metric for detecting leakage of medical terminology. We show that unlearning individual patients erodes knowledge of others with the same condition, calling for methods that can better separate patients from shared clinical knowledge.
IRJul 31, 2023
Lexically-Accelerated Dense RetrievalHrishikesh Kulkarni, Sean MacAvaney, Nazli Goharian et al.
Retrieval approaches that score documents based on learned dense vectors (i.e., dense retrieval) rather than lexical signals (i.e., conventional retrieval) are increasingly popular. Their ability to identify related documents that do not necessarily contain the same terms as those appearing in the user's query (thereby improving recall) is one of their key advantages. However, to actually achieve these gains, dense retrieval approaches typically require an exhaustive search over the document collection, making them considerably more expensive at query-time than conventional lexical approaches. Several techniques aim to reduce this computational overhead by approximating the results of a full dense retriever. Although these approaches reasonably approximate the top results, they suffer in terms of recall -- one of the key advantages of dense retrieval. We introduce 'LADR' (Lexically-Accelerated Dense Retrieval), a simple-yet-effective approach that improves the efficiency of existing dense retrieval models without compromising on retrieval effectiveness. LADR uses lexical retrieval techniques to seed a dense retrieval exploration that uses a document proximity graph. We explore two variants of LADR: a proactive approach that expands the search space to the neighbors of all seed documents, and an adaptive approach that selectively searches the documents with the highest estimated relevance in an iterative fashion. Through extensive experiments across a variety of dense retrieval models, we find that LADR establishes a new dense retrieval effectiveness-efficiency Pareto frontier among approximate k nearest neighbor techniques. Further, we find that when tuned to take around 8ms per query in retrieval latency on our hardware, LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
CLAug 25, 2024
Genetic Approach to Mitigate Hallucination in Generative IRHrishikesh Kulkarni, Nazli Goharian, Ophir Frieder et al.
Generative language models hallucinate. That is, at times, they generate factually flawed responses. These inaccuracies are particularly insidious because the responses are fluent and well-articulated. We focus on the task of Grounded Answer Generation (part of Generative IR), which aims to produce direct answers to a user's question based on results retrieved from a search engine. We address hallucination by adapting an existing genetic generation approach with a new 'balanced fitness function' consisting of a cross-encoder model for relevance and an n-gram overlap metric to promote grounding. Our balanced fitness function approach quadruples the grounded answer generation accuracy while maintaining high relevance.
LGSep 1, 2022
Self-supervised Representation Learning on Electronic Health Records with Graph Kernel InfomaxHao-Ren Yao, Nairen Cao, Katina Russell et al.
Learning Electronic Health Records (EHRs) representation is a preeminent yet under-discovered research topic. It benefits various clinical decision support applications, e.g., medication outcome prediction or patient similarity search. Current approaches focus on task-specific label supervision on vectorized sequential EHR, which is not applicable to large-scale unsupervised scenarios. Recently, contrastive learning shows great success on self-supervised representation learning problems. However, complex temporality often degrades the performance. We propose Graph Kernel Infomax, a self-supervised graph kernel learning approach on the graphical representation of EHR, to overcome the previous problems. Unlike the state-of-the-art, we do not change the graph structure to construct augmented views. Instead, we use Kernel Subspace Augmentation to embed nodes into two geometrically different manifold views. The entire framework is trained by contrasting nodes and graph representations on those two manifold views through the commonly used contrastive objectives. Empirically, using publicly available benchmark EHR datasets, our approach yields performance on clinical downstream tasks that exceeds the state-of-the-art. Theoretically, the variation on distance metrics naturally creates different views as data augmentation without changing graph structures.
CLFeb 26
TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language ModelsReihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian et al.
This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.
LGMay 30, 2025Code
Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation ModelsLiwen Sun, Hao-Ren Yao, Gary Gao et al.
Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieves strong efficacy, with 50% sensitivity in predicting first cancer risks at 99% specificity cutoff, and outperforming feature-based tree models and both general and medical LLMs by up to 20% AUPRC. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.
CLJun 8, 2025
Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE)Christine Bauer, Li Chen, Nicola Ferro et al.
During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
IRAug 29, 2021
TAR on Social Media: A Framework for Online Content ModerationEugene Yang, David D. Lewis, Ophir Frieder
Content moderation (removing or limiting the distribution of posts based on their contents) is one tool social networks use to fight problems such as harassment and disinformation. Manually screening all content is usually impractical given the scale of social media data, and the need for nuanced human interpretations makes fully automated approaches infeasible. We consider content moderation from the perspective of technology-assisted review (TAR): a human-in-the-loop active learning approach developed for high recall retrieval problems in civil litigation and other fields. We show how TAR workflows, and a TAR cost model, can be adapted to the content moderation problem. We then demonstrate on two publicly available content moderation data sets that a TAR workflow can reduce moderation costs by 20% to 55% across a variety of conditions.
IRAug 29, 2021
Certifying One-Phase Technology-Assisted ReviewsDavid D. Lewis, Eugene Yang, Ophir Frieder
Technology-assisted review (TAR) workflows based on iterative active learning are widely used in document review applications. Most stopping rules for one-phase TAR workflows lack valid statistical guarantees, which has discouraged their use in some legal contexts. Drawing on the theory of quantile estimation, we provide the first broadly applicable and statistically valid sample-based stopping rules for one-phase TAR. We further show theoretically and empirically that overshooting a recall target, which has been treated as innocuous or desirable in past evaluations of stopping rules, is a major source of excess cost in one-phase TAR workflows. Counterintuitively, incurring a larger sampling cost to reduce excess recall leads to lower total cost in almost all scenarios.
IRJun 18, 2021
Heuristic Stopping Rules For Technology-Assisted ReviewEugene Yang, David D. Lewis, Ophir Frieder
Technology-assisted review (TAR) refers to human-in-the-loop active learning workflows for finding relevant documents in large collections. These workflows often must meet a target for the proportion of relevant documents found (i.e. recall) while also holding down costs. A variety of heuristic stopping rules have been suggested for striking this tradeoff in particular settings, but none have been tested against a range of recall targets and tasks. We propose two new heuristic stopping rules, Quant and QuantCI based on model-based estimation techniques from survey research. We compare them against a range of proposed heuristics and find they are accurate at hitting a range of recall targets while substantially reducing review costs.
IRJun 18, 2021
On Minimizing Cost in Legal Document Review WorkflowsEugene Yang, David D. Lewis, Ophir Frieder
Technology-assisted review (TAR) refers to human-in-the-loop machine learning workflows for document review in legal discovery and other high recall review tasks. Attorneys and legal technologists have debated whether review should be a single iterative process (one-phase TAR workflows) or whether model training and review should be separate (two-phase TAR workflows), with implications for the choice of active learning algorithm. The relative cost of manual labeling for different purposes (training vs. review) and of different documents (positive vs. negative examples) is a key and neglected factor in this debate. Using a novel cost dynamics analysis, we show analytically and empirically that these relative costs strongly impact whether a one-phase or two-phase workflow minimizes cost. We also show how category prevalence, classification task difficulty, and collection size impact the optimal choice not only of workflow type, but of active learning method and stopping point.
IRMay 3, 2021
Goldilocks: Just-Right Tuning of BERT for Technology-Assisted ReviewEugene Yang, Sean MacAvaney, David D. Lewis et al.
Technology-assisted review (TAR) refers to iterative active learning workflows for document review in high recall retrieval (HRR) tasks. TAR research and most commercial TAR software have applied linear models such as logistic regression to lexical features. Transformer-based models with supervised tuning are known to improve effectiveness on many text classification tasks, suggesting their use in TAR. We indeed find that the pre-trained BERT model reduces review cost by 10% to 15% in TAR workflows simulated on the RCV1-v2 newswire collection. In contrast, we likewise determined that linear models outperform BERT for simulated legal discovery topics on the Jeb Bush e-mail collection. This suggests the match between transformer pre-training corpora and the task domain is of greater significance than generally appreciated. Additionally, we show that just-right language model fine-tuning on the task collection before starting active learning is critical. Too little or too much fine-tuning hinders performance, worse than that of linear models, even for a favorable corpus such as RCV1-v2.
LGFeb 4, 2021
The Analysis from Nonlinear Distance Metric to Kernel-based Drug Prescription Prediction SystemDer-Chen Chang, Ophir Frieder, Chi-Feng Hung et al.
Distance metrics and their nonlinear variant play a crucial role in machine learning based real-world problem solving. We demonstrated how Euclidean and cosine distance measures differ not only theoretically but also in real-world medical application, namely, outcome prediction of drug prescription. Euclidean distance exhibits favorable properties in the local geometry problem. To this regard, Euclidean distance can be applied under short-term disease with low-variation outcome observation. Moreover, when presenting to highly variant chronic disease, it is preferable to use cosine distance. These different geometric properties lead to different submanifolds in the original embedded space, and hence, to different optimizing nonlinear kernel embedding frameworks. We first established the geometric properties that we needed in these frameworks. From these properties interpreted their differences in certain perspectives. Our evaluation on real-world, large-scale electronic health records and embedding space visualization empirically validated our approach.
LGAug 4, 2020
Cross-Global Attention Graph Kernel Network Prediction of Drug PrescriptionHao-Ren Yao, Der-Chen Chang, Ophir Frieder et al.
We present an end-to-end, interpretable, deep-learning architecture to learn a graph kernel that predicts the outcome of chronic disease drug prescription. This is achieved through a deep metric learning collaborative with a Support Vector Machine objective using a graphical representation of Electronic Health Records. We formulate the predictive model as a binary graph classification problem with an adaptive learned graph kernel through novel cross-global attention node matching between patient graphs, simultaneously computing on multiple graphs without training pair or triplet generation. Results using the Taiwanese National Health Insurance Research Database demonstrate that our approach outperforms current start-of-the-art models both in terms of accuracy and interpretability.
CLJul 28, 2020
GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language DetectionSajad Sotudeh, Tong Xiang, Hao-Ren Yao et al.
Offensive language detection is an important and challenging task in natural language processing. We present our submissions to the OffensEval 2020 shared task, which includes three English sub-tasks: identifying the presence of offensive language (Sub-task A), identifying the presence of target in offensive language (Sub-task B), and identifying the categories of the target (Sub-task C). Our experiments explore using a domain-tuned contextualized language model (namely, BERT) for this task. We also experiment with different components and configurations (e.g., a multi-view SVM) stacked upon BERT models for specific sub-tasks. Our submissions achieve F1 scores of 91.7% in Sub-task A, 66.5% in Sub-task B, and 63.2% in Sub-task C. We perform an ablation study which reveals that domain tuning considerably improves the classification performance. Furthermore, error analysis shows common misclassification errors made by our model and outlines research directions for future.
CLMay 18, 2020
Interaction Matching for Long-Tail Multi-Label ClassificationSean MacAvaney, Franck Dernoncourt, Walter Chang et al.
We present an elegant and effective approach for addressing limitations in existing multi-label classification models by incorporating interaction matching, a concept shown to be useful for ad-hoc search result ranking. By performing soft n-gram interaction matching, we match labels with natural language descriptions (which are common to have in most multi-labeling tasks). Our approach can be used to enhance existing multi-label classification approaches, which are biased toward frequently-occurring labels. We evaluate our approach on two challenging tasks: automatic medical coding of clinical notes and automatic labeling of entities from software tutorial text. Our results show that our method can yield up to an 11% relative improvement in macro performance, with most of the gains stemming labels that appear infrequently in the training set (i.e., the long tail of labels).
IRApr 29, 2020
Training Curricula for Open Domain Answer Re-RankingSean MacAvaney, Franco Maria Nardini, Raffaele Perego et al.
In precision-oriented tasks like answer ranking, it is more important to rank many relevant answers highly than to retrieve all relevant answers. It follows that a good ranking strategy would be to learn how to identify the easiest correct answers first (i.e., assign a high ranking score to answers that have characteristics that usually indicate relevance, and a low ranking score to those with characteristics that do not), before incorporating more complex logic to handle difficult cases (e.g., semantic matching or reasoning). In this work, we apply this idea to the training of neural answer rankers using curriculum learning. We propose several heuristics to estimate the difficulty of a given training sample. We show that the proposed heuristics can be used to build a training curriculum that down-weights difficult samples early in the training process. As the training process progresses, our approach gradually shifts to weighting all samples equally, regardless of difficulty. We present a comprehensive evaluation of our proposed idea on three answer ranking datasets. Results show that our approach leads to superior performance of two leading neural ranking architectures, namely BERT and ConvKNRM, using both pointwise and pairwise losses. When applied to a BERT-based ranker, our method yields up to a 4% improvement in MRR and a 9% improvement in P@1 (compared to the model trained without a curriculum). This results in models that can achieve comparable performance to more expensive state-of-the-art techniques.
IRApr 29, 2020
Efficient Document Re-Ranking for Transformers by Precomputing Term RepresentationsSean MacAvaney, Franco Maria Nardini, Raffaele Perego et al.
Deep pretrained transformer networks are effective at various ranking tasks, such as question answering and ad-hoc document ranking. However, their computational expenses deem them cost-prohibitive in practice. Our proposed approach, called PreTTR (Precomputing Transformer Term Representations), considerably reduces the query-time latency of deep transformer networks (up to a 42x speedup on web document ranking) making these networks more practical to use in a real-time ranking scenario. Specifically, we precompute part of the document term representations at indexing time (without a query), and merge them with the query representation at query time to compute the final ranking score. Due to the large size of the token representations, we also propose an effective approach to reduce the storage requirement by training a compression layer to match attention scores. Our compression technique reduces the storage required up to 95% and it can be applied without a substantial degradation in ranking performance.
IRApr 29, 2020
Expansion via Prediction of Importance with ContextualizationSean MacAvaney, Franco Maria Nardini, Raffaele Perego et al.
The identification of relevance with little textual context is a primary challenge in passage retrieval. We address this problem with a representation-based ranking approach that: (1) explicitly models the importance of each term using a contextualized language model; (2) performs passage expansion by propagating the importance to similar terms; and (3) grounds the representations in the lexicon, making them interpretable. Passage representations can be pre-computed at index time to reduce query-time latency. We call our approach EPIC (Expansion via Prediction of Importance with Contextualization). We show that EPIC significantly outperforms prior importance-modeling and document expansion approaches. We also observe that the performance is additive with the current leading first-stage retrieval methods, further narrowing the gap between inexpensive and cost-prohibitive passage ranking approaches. Specifically, EPIC achieves a MRR@10 of 0.304 on the MS-MARCO passage ranking dataset with 78ms average query latency on commodity hardware. We also find that the latency is further reduced to 68ms by pruning document representations, with virtually no difference in effectiveness.
IRJan 9, 2020
Topical Result Caching in Web Search EnginesIda Mele, Nicola Tonellotto, Ophir Frieder et al.
Caching search results is employed in information retrieval systems to expedite query processing and reduce back-end server workload. Motivated by the observation that queries belonging to different topics have different temporal-locality patterns, we investigate a novel caching model called STD (Static-Topic-Dynamic cache). It improves traditional SDC (Static-Dynamic Cache) that stores in a static cache the results of popular queries and manages the dynamic cache with a replacement policy for intercepting the temporal variations in the query stream. Our proposed caching scheme includes another layer for topic-based caching, where the entries are allocated to different topics (e.g., weather, education). The results of queries characterized by a topic are kept in the fraction of the cache dedicated to it. This permits to adapt the cache-space utilization to the temporal locality of the various topics and reduces cache misses due to those queries that are neither sufficiently popular to be in the static portion nor requested within short-time intervals to be in the dynamic portion. We simulate different configurations for STD using two real-world query streams. Experiments demonstrate that our approach outperforms SDC with an increase up to 3% in terms of hit rates, and up to 36% of gap reduction w.r.t. SDC from the theoretical optimal caching algorithm.
IRNov 21, 2018
Overcoming low-utility facets for complex answer retrievalSean MacAvaney, Andrew Yates, Arman Cohan et al.
Many questions cannot be answered simply; their answers must include numerous nuanced details and additional context. Complex Answer Retrieval (CAR) is the retrieval of answers to such questions. In their simplest form, these questions are constructed from a topic entity (e.g., `cheese') and a facet (e.g., `health effects'). While topic matching has been thoroughly explored, we observe that some facets use general language that is unlikely to appear verbatim in answers. We call these low-utility facets. In this work, we present an approach to CAR that identifies and addresses low-utility facets. We propose two estimators of facet utility. These include exploiting the hierarchical structure of CAR queries and using facet frequency information from training data. To improve the retrieval performance on low-utility headings, we also include entity similarity scores using knowledge graph embeddings. We apply our approaches to a leading neural ranking technique, and evaluate using the TREC CAR dataset. We find that our approach perform significantly better than the unmodified neural ranker and other leading CAR techniques. We also provide a detailed analysis of our results, and verify that low-utility facets are indeed more difficult to match, and that our approach improves the performance for these difficult queries.
IRMay 2, 2018
Characterizing Question Facets for Complex Answer RetrievalSean MacAvaney, Andrew Yates, Arman Cohan et al.
Complex answer retrieval (CAR) is the process of retrieving answers to questions that have multifaceted or nuanced answers. In this work, we present two novel approaches for CAR based on the observation that question facets can vary in utility: from structural (facets that can apply to many similar topics, such as 'History') to topical (facets that are specific to the question's topic, such as the 'Westward expansion' of the United States). We first explore a way to incorporate facet utility into ranking models during query term score combination. We then explore a general approach to reform the structure of ranking models to aid in learning of facet utility in the query-document term matching phase. When we use our techniques with a leading neural ranker on the TREC CAR dataset, our methods rank first in the 2017 TREC CAR benchmark, and yield up to 26% higher performance than the next best method.
IRJul 1, 2017
Content-Based Weak Supervision for Ad-Hoc Re-RankingSean MacAvaney, Andrew Yates, Kai Hui et al.
One challenge with neural ranking is the need for a large amount of manually-labeled relevance judgments for training. In contrast with prior work, we examine the use of weak supervision sources for training that yield pseudo query-document pairs that already exhibit relevance (e.g., newswire headline-content pairs and encyclopedic heading-paragraph pairs). We also propose filtering techniques to eliminate training samples that are too far out of domain using two techniques: a heuristic-based approach and novel supervised filter that re-purposes a neural ranker. Using several leading neural ranking architectures and multiple weak supervision datasets, we show that these sources of training pairs are effective on their own (outperforming prior weak supervision techniques), and that filtering can further improve performance.