CLOct 21, 2023
RTSUM: Relation Triple-based Interpretable Summarization with Multi-level Salience VisualizationSeonglae Cho, Yonggi Cho, HoonJae Lee et al.
In this paper, we present RTSUM, an unsupervised summarization framework that utilizes relation triples as the basic unit for summarization. Given an input document, RTSUM first selects salient relation triples via multi-level salience scoring and then generates a concise summary from the selected relation triples by using a text-to-text language model. On the basis of RTSUM, we also develop a web demo for an interpretable summarizing tool, providing fine-grained interpretations with the output summary. With support for customization options, our tool visualizes the salience for textual units at three distinct levels: sentences, relation triples, and phrases. The codes,are publicly available.
AIApr 6, 2023
Evidentiality-aware Retrieval for Overcoming Abstractiveness in Open-Domain Question AnsweringYongho Song, Dahyun Lee, Myungha Jang et al.
The long-standing goal of dense retrievers in abtractive open-domain question answering (ODQA) tasks is to learn to capture evidence passages among relevant passages for any given query, such that the reader produce factually correct outputs from evidence passages. One of the key challenge is the insufficient amount of training data with the supervision of the answerability of the passages. Recent studies rely on iterative pipelines to annotate answerability using signals from the reader, but their high computational costs hamper practical applications. In this paper, we instead focus on a data-centric approach and propose Evidentiality-Aware Dense Passage Retrieval (EADPR), which leverages synthetic distractor samples to learn to discriminate evidence passages from distractors. We conduct extensive experiments to validate the effectiveness of our proposed method on multiple abstractive ODQA tasks.
IRJun 20, 2018
Explaining Controversy on Social Media via Stance SummarizationMyungha Jang, James Allan
In an era in which new controversies rapidly emerge and evolve on social media, navigating social media platforms to learn about a new controversy can be an overwhelming task. In this light, there has been significant work that studies how to identify and measure controversy online. However, we currently lack a tool for effectively understanding controversy in social media. For example, users have to manually examine postings to find the arguments of conflicting stances that make up the controversy. In this paper, we study methods to generate a stance-aware summary that explains a given controversy by collecting arguments of two conflicting stances. We focus on Twitter and treat stance summarization as a ranking problem of finding the top k tweets that best summarize the two conflicting stances of a controversial topic. We formalize the characteristics of a good stance summary and propose a ranking model accordingly. We first evaluate our methods on five controversial topics on Twitter. Our user evaluation shows that our methods consistently outperform other baseline techniques in generating a summary that explains the given controversy.
IRMar 29, 2017
Is Climate Change Controversial? Modeling Controversy as Contention Within PopulationsShiri Dori-Hacohen, Myungha Jang, James Allan
A growing body of research focuses on computationally detecting controversial topics and understanding the stances people hold on them. Yet gaps remain in our theoretical and practical understanding of how to define controversy, how it manifests, and how to measure it. In this paper, we introduce a novel measure we call "contention", defined with respect to a topic and a population. We model contention from a mathematical standpoint. We validate our model by examining a diverse set of sources: real-world polling data sets, actual voter data, and Twitter coverage on several topics. In our publicly-released Twitter data set of nearly 100M tweets, we examine several topics such as Brexit, the 2016 U.S. Elections, and "The Dress", and cross-reference them with other sources. We demonstrate that the contention measure holds explanatory power for a wide variety of observed phenomena, such as controversies over climate change and other topics that are well within scientific consensus. Finally, we re-examine the notion of controversy, and present a theoretical framework that defines it in terms of population. We present preliminary evidence suggesting that contention is one dimension of controversy, along with others, such as "importance". Our new contention measure, along with the hypothesized model of controversy, suggest several avenues for future work in this emerging interdisciplinary research area.
IRMar 16, 2017
Improving Document Clustering by Eliminating Unnatural LanguageMyungha Jang, Jinho D. Choi, James Allan
Technical documents contain a fair amount of unnatural language, such as tables, formulas, pseudo-codes, etc. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering up to 15%. Our corpus and tool are publicly available.