Kevin Small

CL
h-index24
17papers
3,949citations
Novelty47%
AI Score47

17 Papers

CLApr 29, 2022
Answer Consolidation: Formulation and Benchmarking

Wenxuan Zhou, Qiang Ning, Heba Elfardy et al. · amazon-science

Current question answering (QA) systems primarily consider the single-answer scenario, where each question is assumed to be paired with one correct answer. However, in many real-world QA applications, multiple answer scenarios arise where consolidating answers into a comprehensive and non-redundant set of answers is a more efficient user interface. In this paper, we formulate the problem of answer consolidation, where answers are partitioned into multiple groups, each representing different aspects of the answer set. Then, given this partitioning, a comprehensive and non-redundant set of answers can be constructed by picking one answer from each group. To initiate research on answer consolidation, we construct a dataset consisting of 4,699 questions and 24,006 sentences and evaluate multiple models. Despite a promising performance achieved by the best-performing supervised models, we still believe this task has room for further improvements.

CLOct 24, 2023
Background Summarization of Event Timelines

Adithya Pratapa, Kevin Small, Markus Dreyer · amazon-science, cmu

Generating concise summaries of news events is a challenging natural language processing task. While journalists often curate timelines to highlight key sub-events, newcomers to a news event face challenges in catching up on its historical context. In this paper, we address this need by introducing the task of background news summarization, which complements each timeline update with a background summary of relevant preceding events. We construct a dataset by merging existing timeline datasets and asking human annotators to write a background summary for each timestep of each news event. We establish strong baseline performance using state-of-the-art summarization systems and propose a query-focused variant to generate background summaries. To evaluate background summary quality, we present a question-answering-based evaluation metric, Background Utility Score (BUS), which measures the percentage of questions about a current event timestep that a background summary answers. Our experiments show the effectiveness of instruction fine-tuned systems such as Flan-T5, in addition to strong zero-shot performance using GPT-3.5.

CLSep 23, 2024
Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA

Nirmal Roy, Leonardo F. R. Ribeiro, Rexhina Blloshmi et al. · amazon-science

Augmenting Large Language Models (LLMs) with information retrieval capabilities (i.e., Retrieval-Augmented Generation (RAG)) has proven beneficial for knowledge-intensive tasks. However, understanding users' contextual search intent when generating responses is an understudied topic for conversational question answering (QA). This conversational extension leads to additional concerns when compared to single-turn QA as it is more challenging for systems to comprehend conversational context and manage retrieved passages over multiple turns. In this work, we propose a method for enabling LLMs to decide when to retrieve in RAG settings given a conversational context. When retrieval is deemed necessary, the LLM then rewrites the conversation for passage retrieval and judges the relevance of returned passages before response generation. Operationally, we build on the single-turn SELF-RAG framework (Asai et al., 2023) and propose SELF-multi-RAG for conversational settings. SELF-multi-RAG demonstrates improved capabilities over single-turn variants with respect to retrieving relevant passages (by using summarized conversational context) and assessing the quality of generated responses. Experiments on three conversational QA datasets validate the enhanced response generation capabilities of SELF-multi-RAG, with improvements of ~13% measured by human annotation.

CLDec 2, 2022
SumREN: Summarizing Reported Speech about Events in News

Revanth Gangi Reddy, Heba Elfardy, Hou Pong Chan et al.

A primary objective of news articles is to establish the factual record for an event, frequently achieved by conveying both the details of the specified event (i.e., the 5 Ws; Who, What, Where, When and Why regarding the event) and how people reacted to it (i.e., reported statements). However, existing work on news summarization almost exclusively focuses on the event details. In this work, we propose the novel task of summarizing the reactions of different speakers, as expressed by their reported statements, to a given event. To this end, we create a new multi-document summarization benchmark, SUMREN, comprising 745 summaries of reported statements from various public figures obtained from 633 news articles discussing 132 events. We propose an automatic silver training data generation approach for our task, which helps smaller models like BART achieve GPT-3 level performance on this task. Finally, we introduce a pipeline-based framework for summarizing reported speech, which we empirically show to generate summaries that are more abstractive and factual than baseline query-focused summarization approaches.

CLMay 24, 2022
PLAtE: A Large-scale Dataset for List Page Web Extraction

Aidan San, Yuan Zhuang, Jan Bakus et al.

Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) benchmark dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items encompassing the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 52, 898 items collected from 6, 694 pages and 156, 014 attributes, making it the first largescale list page web extraction dataset. We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.

CLJan 30
Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics

Yilun Hua, Giuseppe Castellucci, Peter Schulam et al.

Retrieval Augmented Generation (RAG)'s success depends on the utility the LLM derives from the content used for grounding. Quantifying content utility does not have a definitive specification and existing metrics ignore model-specific capabilities and/or rely on costly annotations. In this paper, we propose Grounding Generation Utility (GroGU), a model-specific and reference-free metric that defines utility as a function of the downstream LLM's generation confidence based on entropy. Despite having no annotation requirements, GroGU is largely faithful in distinguishing ground-truth documents while capturing nuances ignored by LLM-agnostic metrics. We apply GroGU to train a query-rewriter for RAG by identifying high-utility preference data for Direct Preference Optimization. Experiments show improvements by up to 18.2 points in Mean Reciprocal Rank and up to 9.4 points in answer accuracy.

CLJul 30, 2025Code
WINELL: Wikipedia Never-Ending Updating with LLM Agents

Revanth Gangi Reddy, Tanay Dixit, Jiaxin Qin et al.

Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia's extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL's ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.

CLMay 21, 2020Code
Fluent Response Generation for Conversational Question Answering

Ashutosh Baheti, Alan Ritter, Kevin Small

Question answering (QA) is an important aspect of open-domain conversational agents, garnering specific research focus in the conversational QA (ConvQA) subtask. One notable limitation of recent ConvQA efforts is the response being answer span extraction from the target corpus, thus ignoring the natural language generation (NLG) aspect of high-quality conversational agents. In this work, we propose a method for situating QA responses within a SEQ2SEQ NLG approach to generate fluent grammatical answer responses while maintaining correctness. From a technical perspective, we use data augmentation to generate training data for an end-to-end system. Specifically, we develop Syntactic Transformations (STs) to produce question-specific candidate answer responses and rank them using a BERT-based classifier (Devlin et al., 2019). Human evaluation on SQuAD 2.0 data (Rajpurkar et al., 2018) demonstrate that the proposed model outperforms baseline CoQA and QuAC models in generating conversational responses. We further show our model's scalability by conducting tests on the CoQA dataset. The code and data are available at https://github.com/abaheti95/QADialogSystem.

CLNov 7, 2019Code
Multi-domain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question Answering

Li Zhou, Kevin Small

Multi-domain dialogue state tracking (DST) is a critical component for conversational AI systems. The domain ontology (i.e., specification of domains, slots, and values) of a conversational AI system is generally incomplete, making the capability for DST models to generalize to new slots, values, and domains during inference imperative. In this paper, we propose to model multi-domain DST as a question answering problem, referred to as Dialogue State Tracking via Question Answering (DSTQA). Within DSTQA, each turn generates a question asking for the value of a (domain, slot) pair, thus making it naturally extensible to unseen domains, slots, and values. Additionally, we use a dynamically-evolving knowledge graph to explicitly learn relationships between (domain, slot) pairs. Our model has a 5.80% and 12.21% relative improvement over the current state-of-the-art model on MultiWOZ 2.0 and MultiWOZ 2.1 datasets, respectively. Additionally, our model consistently outperforms the state-of-the-art model in domain adaptation settings. (Code is released at https://github.com/alexa/dstqa )

CLFeb 16, 2024
Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement

Chenkai Sun, Ke Yang, Revanth Gangi Reddy et al.

The increasing demand for personalized interactions with large language models (LLMs) calls for methodologies capable of accurately and efficiently identifying user opinions and preferences. Retrieval augmentation emerges as an effective strategy, as it can accommodate a vast number of users without the costs from fine-tuning. Existing research, however, has largely focused on enhancing the retrieval stage and devoted limited exploration toward optimizing the representation of the database, a crucial aspect for tasks such as personalization. In this work, we examine the problem from a novel angle, focusing on how data can be better represented for more data-efficient retrieval in the context of LLM customization. To tackle this challenge, we introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts and collaborative refinement to effectively bridge knowledge gaps among users. In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size, a critical advantage in scenarios with extensive histories or limited context windows. Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data. Furthermore, our analysis reveals the increasing importance of collaborative knowledge as the retrieval capacity expands.

CLApr 2, 2024
Towards Better Generalization in Open-Domain Question Answering by Mitigating Context Memorization

Zixuan Zhang, Revanth Gangi Reddy, Kevin Small et al.

Open-domain Question Answering (OpenQA) aims at answering factual questions with an external large-scale knowledge corpus. However, real-world knowledge is not static; it updates and evolves continually. Such a dynamic characteristic of knowledge poses a vital challenge for these models, as the trained models need to constantly adapt to the latest information to make sure that the answers remain accurate. In addition, it is still unclear how well an OpenQA model can transfer to completely new knowledge domains. In this paper, we investigate the generalization performance of a retrieval-augmented QA model in two specific scenarios: 1) adapting to updated versions of the same knowledge corpus; 2) switching to completely different knowledge domains. We observe that the generalization challenges of OpenQA models stem from the reader's over-reliance on memorizing the knowledge from the external corpus, which hinders the model from generalizing to a new knowledge corpus. We introduce Corpus-Invariant Tuning (CIT), a simple but effective training strategy, to mitigate the knowledge over-memorization by controlling the likelihood of retrieved contexts during training. Extensive experimental results on multiple OpenQA benchmarks show that CIT achieves significantly better generalizability without compromising the model's performance in its original corpus and domain.

CLDec 16, 2021
NewsClaims: A New Benchmark for Claim Detection from News with Attribute Knowledge

Revanth Gangi Reddy, Sai Chetan, Zhenhailong Wang et al.

Claim detection and verification are crucial for news understanding and have emerged as promising technologies for mitigating misinformation and disinformation in the news. However, most existing work has focused on claim sentence analysis while overlooking additional crucial attributes (e.g., the claimer and the main object associated with the claim). In this work, we present NewsClaims, a new benchmark for attribute-aware claim detection in the news domain. We extend the claim detection problem to include extraction of additional attributes related to each claim and release 889 claims annotated over 143 news articles. NewsClaims aims to benchmark claim detection systems in emerging scenarios, comprising unseen topics with little or no training data. To this end, we see that zero-shot and prompt-based baselines show promising performance on this benchmark, while still considerably behind human performance.

CLSep 10, 2021
Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning

Li Zhou, Kevin Small, Yong Zhang et al.

Motivated by suggested question generation in conversational news recommendation systems, we propose a model for generating question-answer pairs (QA pairs) with self-contained, summary-centric questions and length-constrained, article-summarizing answers. We begin by collecting a new dataset of news articles with questions as titles and pairing them with summaries of varying length. This dataset is used to learn a QA pair generation model producing summaries as answers that balance brevity with sufficiency jointly with their corresponding questions. We then reinforce the QA pair generation process with a differentiable reward function to mitigate exposure bias, a common problem in natural language generation. Both automatic metrics and human evaluation demonstrate these QA pairs successfully capture the central gists of the articles and achieve high answer accuracy.

CLOct 19, 2020
Summary-Oriented Question Generation for Informational Queries

Xusen Yin, Li Zhou, Kevin Small et al.

Users frequently ask simple factoid questions for question answering (QA) systems, attenuating the impact of myriad recent works that support more complex questions. Prompting users with automatically generated suggested questions (SQs) can improve user understanding of QA system capabilities and thus facilitate more effective use. We aim to produce self-explanatory questions that focus on main document topics and are answerable with variable length passages as appropriate. We satisfy these requirements by using a BERT-based Pointer-Generator Network trained on the Natural Questions (NQ) dataset. Our model shows SOTA performance of SQ generation on the NQ dataset (20.1 BLEU-4). We further apply our model on out-of-domain news articles, evaluating with a QA system due to the lack of gold questions and demonstrate that our model produces better SQs for news articles -- with further confirmation via a human evaluation.

LGAug 16, 2020
Inverse Reinforcement Learning with Natural Language Goals

Li Zhou, Kevin Small

Humans generally use natural language to communicate task requirements to each other. Ideally, natural language should also be usable for communicating goals to autonomous machines (e.g., robots) to minimize friction in task specification. However, understanding and mapping natural language goals to sequences of states and actions is challenging. Specifically, existing work along these lines has encountered difficulty in generalizing learned policies to new natural language goals and environments. In this paper, we propose a novel adversarial inverse reinforcement learning algorithm to learn a language-conditioned policy and reward function. To improve generalization of the learned policy and reward function, we use a variational goal generator to relabel trajectories and sample diverse goals during training. Our algorithm outperforms multiple baselines by a large margin on a vision-based natural language instruction following dataset (Room-2-Room), demonstrating a promising advance in enabling the use of natural language instructions in specifying agent goals.

LGNov 30, 2018
Active Learning in Recommendation Systems with Multi-level User Preferences

Yuheng Bu, Kevin Small

While recommendation systems generally observe user behavior passively, there has been an increased interest in directly querying users to learn their specific preferences. In such settings, considering queries at different levels of granularity to optimize user information acquisition is crucial to efficiently providing a good user experience. In this work, we study the active learning problem with multi-level user preferences within the collective matrix factorization (CMF) framework. CMF jointly captures multi-level user preferences with respect to items and relations between items (e.g., book genre, cuisine type), generally resulting in improved predictions. Motivated by finite-sample analysis of the CMF model, we propose a theoretically optimal active learning strategy based on the Fisher information matrix and use this to derive a realizable approximation algorithm for practical recommendations. Experiments are conducted using both the Yelp dataset directly and an illustrative synthetic dataset in the three settings of personalized active learning, cold-start recommendations, and noisy data -- demonstrating strong improvements over several widely used active learning methods.

AIDec 7, 2017
End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient

Li Zhou, Kevin Small, Oleg Rokhlenko et al.

Learning a goal-oriented dialog policy is generally performed offline with supervised learning algorithms or online with reinforcement learning (RL). Additionally, as companies accumulate massive quantities of dialog transcripts between customers and trained human agents, encoder-decoder methods have gained popularity as agent utterances can be directly treated as supervision without the need for utterance-level annotations. However, one potential drawback of such approaches is that they myopically generate the next agent utterance without regard for dialog-level considerations. To resolve this concern, this paper describes an offline RL method for learning from unannotated corpora that can optimize a goal-oriented policy at both the utterance and dialog level. We introduce a novel reward function and use both on-policy and off-policy policy gradient to learn a policy offline without requiring online user interaction or an explicit state space definition.