CYFeb 28, 2019
Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's VerifiabilityMiriam Redi, Besnik Fetahu, Jonathan Morgan et al.
Wikipedia is playing an increasingly central role on the web,and the policies its contributors follow when sourcing and fact-checking content affect million of readers. Among these core guiding principles, verifiability policies have a particularly important role. Verifiability requires that information included in a Wikipedia article be corroborated against reliable secondary sources. Because of the manual labor needed to curate and fact-check Wikipedia at scale, however, its contents do not always evenly comply with these policies. Citations (i.e. reference to external sources) may not conform to verifiability requirements or may be missing altogether, potentially weakening the reliability of specific topic areas of the free encyclopedia. In this paper, we aim to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we construct a taxonomy of reasons why inline citations are required by collecting labeled data from editors of multiple Wikipedia language editions. We then collect a large-scale crowdsourced dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we design and evaluate algorithmic models to determine if a statement requires a citation, and to predict the citation reason based on our taxonomy. We evaluate the robustness of such models across different classes of Wikipedia articles of varying quality, as well as on an additional dataset of claims annotated for fact-checking purposes.
CLOct 31, 2018
WikiConv: A Corpus of the Complete Conversational History of a Large Online Collaborative CommunityYiqing Hua, Cristian Danescu-Niculescu-Mizil, Dario Taraborelli et al.
We present a corpus that encompasses the complete history of conversations between contributors to Wikipedia, one of the largest online collaborative communities. By recording the intermediate states of conversations---including not only comments and replies, but also their modifications, deletions and restorations---this data offers an unprecedented view of online conversation. This level of detail supports new research questions pertaining to the process (and challenges) of large-scale online collaboration. We illustrate the corpus' potential with two case studies that highlight new perspectives on earlier work. First, we explore how a person's conversational behavior depends on how they relate to the discussion's venue. Second, we show that community moderation of toxic behavior happens at a higher rate than previously estimated. Finally the reconstruction framework is designed to be language agnostic, and we show that it can extract high quality conversational data in both Chinese and English.
CLMay 14, 2018
Conversations Gone Awry: Detecting Early Signs of Conversational FailureJustine Zhang, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil et al.
One of the main challenges online social systems face is the prevalence of antisocial behavior, such as harassment and personal attacks. In this work, we introduce the task of predicting from the very start of a conversation whether it will get out of hand. As opposed to detecting undesirable behavior after the fact, this task aims to enable early, actionable prediction at a time when the conversation might still be salvaged. To this end, we develop a framework for capturing pragmatic devices---such as politeness strategies and rhetorical prompts---used to start a conversation, and analyze their relation to its future trajectory. Applying this framework in a controlled setting, we demonstrate the feasibility of detecting early warning signs of antisocial behavior in online discussions.
IRMar 10, 2017
Building automated vandalism detection tools for WikidataAmir Sarabadani, Aaron Halfaker, Dario Taraborelli
Wikidata, like Wikipedia, is a knowledge base that anyone can edit. This open collaboration model is powerful in that it reduces barriers to participation and allows a large number of people to contribute. However, it exposes the knowledge base to the risk of vandalism and low-quality contributions. In this work, we build on past work detecting vandalism in Wikipedia to detect vandalism in Wikidata. This work is novel in that identifying damaging changes in a structured knowledge-base requires substantially different feature engineering work than in a text-based wiki like Wikipedia. We also discuss the utility of these classifiers for reducing the overall workload of vandalism patrollers in Wikidata. We describe a machine classification strategy that is able to catch 89% of vandalism while reducing patrollers' workload by 98%, by drawing lightly from contextual features of an edit and heavily from the characteristics of the user making the edit.
SISep 4, 2014
MoodBar: Increasing new user retention in Wikipedia through lightweight socializationGiovanni Luca Ciampaglia, Dario Taraborelli
Socialization in online communities allows existing members to welcome and recruit newcomers, introduce them to community norms and practices, and sustain their early participation. However, socializing newcomers does not come for free: in large communities, socialization can result in a significant workload for mentors and is hard to scale. In this study we present results from an experiment that measured the effect of a lightweight socialization tool on the activity and retention of newly registered users attempting to edit for the first time Wikipedia. Wikipedia is struggling with the retention of newcomers and our results indicate that a mechanism to elicit lightweight feedback and to provide early mentoring to newcomers improves their chances of becoming long-term contributors.