Abram Handler

CL
7papers
3,113citations
Novelty34%
AI Score41

7 Papers

11.6SIMay 7
TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time

Chloe Eggleston, Abram Handler, Maria Leonor Pacheco

YouTube is central to contemporary mass media. However, the official YouTube API does not provide access to the full set of creators or creator metadata on the platform. This lack of basic visibility into the YouTube ecosystem hinders understanding of the platform's creator economy. Researchers currently have no easy, transparent, or replicable way to construct large-scale datasets of YouTube creators and their audiences over time. This makes it challenging to study vital social questions, such as how changes to the YouTube recommendation algorithm shape creator incentives and by extension the mass media on the platform. We address this gap with TubeCensus, a large-scale longitudinal dataset of YouTube creators and subscriber counts, constructed by collecting, linking, and organizing nearly two decades of YouTube page captures from the Internet Archive. This approach is transparent and replicable and does not require interaction with the YouTube API, whose output can change over time. We validate the coverage of TubeCensus against prior estimates of YouTube's size and find that our resource includes creators responsible for at least 30-36% of all YouTube content. We also find that TubeCensus provides good coverage of prominent creators. To support future research, we hide the substantial complexities of the YouTube identifier system and Internet Archive capture system by distributing our dataset via an easy-to-use pip package. Finally, we use our resource to complete basic exploratory analysis of YouTube channel content and the mechanisms associated with YouTube channel growth.

CLSep 7, 2019
Investigating Sports Commentator Bias within a Large Corpus of American Football Broadcasts

Jack Merullo, Luke Yeh, Abram Handler et al.

Sports broadcasters inject drama into play-by-play commentary by building team and player narratives through subjective analyses and anecdotes. Prior studies based on small datasets and manual coding show that such theatrics evince commentator bias in sports broadcasts. To examine this phenomenon, we assemble FOOTBALL, which contains 1,455 broadcast transcripts from American football games across six decades that are automatically annotated with 250K player mentions and linked with racial metadata. We identify major confounding factors for researchers examining racial bias in FOOTBALL, and perform a computational analysis that supports conclusions from prior social science studies.

CLApr 19, 2019
Query-focused Sentence Compression in Linear Time

Abram Handler, Brendan O'Connor

Search applications often display shortened sentences which must contain certain query terms and must fit within the space constraints of a user interface. This work introduces a new transition-based sentence compression technique developed for such settings. Our query-focused method constructs length and lexically constrained compressions in linear time, by growing a subgraph in the dependency parse of a sentence. This theoretically efficient approach achieves an 11X empirical speedup over baseline ILP methods, while better reconstructing gold constrained shortenings. Such speedups help query-focused applications, because users are measurably hindered by interface lags. Additionally, our technique does not require an ILP solver or a GPU.

CLFeb 1, 2019
Human acceptability judgements for extractive sentence compression

Abram Handler, Brian Dillon, Brendan O'Connor

Recent approaches to English-language sentence compression rely on parallel corpora consisting of sentence-compression pairs. However, a sentence may be shortened in many different ways, which each might be suited to the needs of a particular application. Therefore, in this work, we collect and model crowdsourced judgements of the acceptability of many possible sentence shortenings. We then show how a model of such judgements can be used to support a flexible approach to the compression task. We release our model and dataset for future work.

HCAug 6, 2017
Rookie: A unique approach for exploring news archives

Abram Handler, Brendan O'Connor

News archives are an invaluable primary source for placing current events in historical context. But current search engine tools do a poor job at uncovering broad themes and narratives across documents. We present Rookie: a practical software system which uses natural language processing (NLP) to help readers, reporters and editors uncover broad stories in news archives. Unlike prior work, Rookie's design emerged from 18 months of iterative development in consultation with editors and computational journalists. This process lead to a dramatically different approach from previous academic systems with similar goals. Our efforts offer a generalizable case study for others building real-world journalism software using NLP.

CLJul 22, 2017
Identifying civilians killed by police with distantly supervised entity-event extraction

Katherine A. Keith, Abram Handler, Michael Pinkham et al.

We propose a new, socially-impactful task for natural language processing: from a news corpus, extract names of persons who have been killed by police. We present a newly collected police fatality corpus, which we release publicly, and present a model to solve this problem that uses EM-based distant supervision with logistic regression and convolutional neural network classifiers. Our model outperforms two off-the-shelf event extractor systems, and it can suggest candidate victim names in some cases faster than one of the major manually-collected police fatality databases.

MLJun 20, 2016
Visualizing textual models with in-text and word-as-pixel highlighting

Abram Handler, Su Lin Blodgett, Brendan O'Connor

We explore two techniques which use color to make sense of statistical text models. One method uses in-text annotations to illustrate a model's view of particular tokens in particular documents. Another uses a high-level, "words-as-pixels" graphic to display an entire corpus. Together, these methods offer both zoomed-in and zoomed-out perspectives into a model's understanding of text. We show how these interconnected methods help diagnose a classifier's poor performance on Twitter slang, and make sense of a topic model on historical political texts.