IRMay 1, 2019Code
The Literary Theme Ontology for Media Annotation and Information RetrievalPaul Sheridan, Mikael Onsjö, Janna Hastings
Literary theme identification and interpretation is a focal point of literary studies scholarship. Classical forms of literary scholarship, such as close reading, have flourished with scarcely any need for commonly defined literary themes. However, the rise in popularity of collaborative and algorithmic analyses of literary themes in works of fiction, together with a requirement for computational searching and indexing facilities for large corpora, creates the need for a collection of shared literary themes to ensure common terminology and definitions. To address this need, we here introduce a first draft of the Literary Theme Ontology. Inspired by a traditional framing from literary theory, the ontology comprises literary themes drawn from the authors own analyses, reference books, and online sources. The ontology is available at https://github.com/theme-ontology/lto under a Creative Commons Attribution 4.0 International license (CC BY 4.0).
IRFeb 26, 2020
The hypergeometric test performs comparably to TF-IDF on standard text analysis tasksPaul Sheridan, Mikael Onsjö
Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that TF-IDF corresponds very nearly with the hypergeometric test of statistical significance on selected real-data document retrieval, summarization, and classification tasks. These findings suggest that a fundamental mathematical connection between TF-IDF and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We advance the empirical analyses herein as a first step toward explaining the long-standing effectiveness of TF-IDF from a statistical significance testing lens. It is our aspiration that these results will open the door to the systematic evaluation of significance testing derived term weighting functions in text analysis applications.
IRJul 31, 2018
An Ontology-Based Recommender System with an Application to the Star Trek Television FranchisePaul Sheridan, Mikael Onsjö, Claudia Becerra et al.
Collaborative filtering based recommender systems have proven to be extremely successful in settings where user preference data on items is abundant. However, collaborative filtering algorithms are hindered by their weakness against the item cold-start problem and general lack of interpretability. Ontology-based recommender systems exploit hierarchical organizations of users and items to enhance browsing, recommendation, and profile construction. While ontology-based approaches address the shortcomings of their collaborative filtering counterparts, ontological organizations of items can be difficult to obtain for items that mostly belong to the same category (e.g., television series episodes). In this paper, we present an ontology-based recommender system that integrates the knowledge represented in a large ontology of literary themes to produce fiction content recommendations. The main novelty of this work is an ontology-based method for computing similarities between items and its integration with the classical Item-KNN (K-nearest neighbors) algorithm. As a study case, we evaluated the proposed method against other approaches by performing the classical rating prediction task on a collection of Star Trek television series episodes in an item cold-start scenario. This transverse evaluation provides insights into the utility of different information resources and methods for the initial stages of recommender system development. We found our proposed method to be a convenient alternative to collaborative filtering approaches for collections of mostly similar items, particularly when other content-based approaches are not applicable or otherwise unavailable. Aside from the new methods, this paper contributes a testbed for future research and an online framework to collaboratively extend the ontology of literary themes to cover other narrative content.